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forces  on  N  point-masses  interacting  in  three-dimensional  space.  The  algo¬ 
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Chapter  1 


Introduction 


Numerical  simulations  of  N- body  interactions  are  extremely  important  in 
astrophysics,  plasma  physics,  fluid  dynamics,  and  molecular  dynamics.  To 
accomplish  such  a  simulation  by  directly  evaluating  the  interaction  potentials 
requires  computation  time  on  the  order  of  0(N 2),  which  is  prohibitively 
expensive  for  large  N.  Consequently,  there  have  been  many  efforts  to  develop 
algorithms  whose  complexity  grows  more  slowly  [Lec72].  These  algorithms 
all  make  assumptions  about  the  distribution  of  particles  to  approximate  the 
interaction  potentials,  and  thus  are  applicable  only  to  certain  classes  of  N- 
body  interactions. 

The  best-known  “fast”  algorithm  is  the  tree-code  [App85],  which  uses  a 
tree-like  data  structure  to  form  hierarchical  clusters  of  particles.  The  algo¬ 
rithm  approximates  the  force  exerted  by  a  cluster  on  a  distant  particle  as 
the  force  exerted  by  an  equivalent  mass  located  the  cluster’s  center  of  mass. 
The  overall  computational  complexity  of  this  algorithm  is  believed  to  be 
0{N  log  N).  However,  the  force  approximations  are  effective  when  the  par¬ 
ticle  distribution  is  relatively  nonuniform,  and  there  are  no  known  a  priori 
estimates  of  the  accuracy  of  these  approximations,  although  methods  have 
been  developed  for  use  in  practice  [Por85]  [BH86]. 

Rokhlin  and  Greengard  [GR86]  discovered  an  fV-body  algorithm  with 
complexity  0(N).  The  algorithm  separates  the  computation  of  the  far- field 
potentials  from  that  of  the  near-field  potentials,  and  expands  the  far-lield 
potentials  into  Taylor  series.  Sigmncantly,  given  a  required  precision  e,  one 
can  determine  in  advance  how  many  terms  must  be  retained  in  the  Taylor  se¬ 
ries  expansions,  and  thus  one  can  control  the  accuracy  of  the  approximation. 
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R.okhlin  and  Greengard’s  approach  relies  heavily  upon  analytical  techniques 
to  develop  the  series  expansions,  to  derive  error  estimates,  and  to  shift  se¬ 
ries  expansions  between  coordinate  frames.  For  the  fV-body  problem  in  two 
dimensions,  this  analysis  is  greatly  simplified  by  modeling  two-dimensional 
space  as  the  complex  plane  and  using  the  theory  of  functions  of  a  complex 
variable. 

This  thesis  applies  Rokhlin  and  Greengard’s  idea  to  develop  an  0{N ) 
algorithm  for  the  three-dimensional  JV-body  problem.  Although  there  is 
no  direct  analogue  of  complex  analysis  in  three  dimensions,  we  are  able  to 
exploit  the  harmonic  nature  of  the  gravitational  potential  in  order  to  produce 
suitable  series  expansions.  We  implement  the  algorithm  and  experimentally 
verify  its  accuracy  and  its  linear  growth.  We  also  implement  a  parallel  version 
of  the  algorithm  and  present  experimental  results. 

Chapter  2  of  the  thesis  reviews  Rokhlin  and  Greengard’s  algorithm  and 
their  use  of  complex  variables  for  the  two-dimensional  problem  and  points 
out  difficulties  in  applying  the  method  in  three  dimensions. 

Chapter  3  develops  a  three-dimensional  analogue  of  Rokhlin  and  Green¬ 
gard’s  method.  The  basic  idea  is  to  expand  the  potentials  in  terms  of  spher¬ 
ical  harmonics.  The  analytical  basis  of  the  algorithm  includes  two  theorems 
that  derive  the  required  expansions  and  give  bounds  on  the  error  terms. 
There  are  also  three  lemmas  that  describe  how  to  shift  these  expansions  from 
one  origin  to  another.  Using  these  analytical  results,  we  present  a  sequential 
algorithm  in  three-dimensions.  Independently  of  this  work,  a  similar  exten¬ 
sion  of  Rokhlin  and  Greengard’s  method  has  been  carried  out  by  Greengard 
[Gre87].  His  technique  differs  from  ours  in  that  he  uses  expansions  in  polar 
coordinates  and  we  work  in  Cartesian  coordinates. 

Chapter  4  concerns  numerical  verifications  of  the  three-dimensional  al¬ 
gorithm.  We  first  describe  a  sequential  implementation  of  the  algorithm. 
We  discuss  our  choice  of  a  tree-shaped  data  structure  and  describe  heuris¬ 
tic  methods  for  increasing  the  efficiency  of  the  implementation.  We  then 
present  experimental  results  to  demonstrate  that  the  algorithm  does  have 
linear  growth  and  does  compute  the  forces  and  potentials  to  within  any  pre¬ 
specified  tolerance  up  to  machine  precision.  We  compare  the  speed  of  the 
algorithm  with  that  of  an  0(N2)  direct  force  computation.  For  a  required 
average  accuracy  of  10-4  for  potential  fields,  our  algorithm  will  be  faster  than 
the  direct  force  computation  when  there  are  more  than  1,000  particles. 

Chapter  5  presents  an  extremely  fast  implementation  of  the  fV-body 
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code — a  parallel  implementation  of  the  algorithm,  which  runs  in  0(log  N) 
time,  implemented  on  the  Connection  Machine.  We  compare  the  experi¬ 
mental  results  with  that  of  the  sequential  implementation  and  find  that  the 
parallel  algorithm  has  a  speed-up  of  10  for  N  =  1,000  and  a  speed-up  of  100 
for  N  —  10,000.  For  a  large  parallel  machine,  interprocessor  communica¬ 
tion  is  often  the  bottleneck  of  the  computation.  By  exploiting  regularity  and 
locality  in  communication  patterns,  by  combining  and  delegating  messages, 
we  demonstrate  how  to  minimize  communication  overhead  due  to  message 
congestion. 

Throughout  this  thesis,  if  not  specified  otherwise,  N  will  always  refer  to 
the  total  number  of  particles  in  a  simulation,  and  p  will  always  refer  to  the 
highest  degree  of  harmonics  retained  in  an  expansion. 


Chapter  2 


The  TV-body  Algorithm  in  Two 
Dimensions 


This  chapter  follows  the  presentation  of  the  two-dimensional  algorithm  given 
by  Rokhlin  and  Greengard  in  [GR86].  We  review  the  major  theorems  and 
give  a  heuristic  description  of  the  algorithm.  The  key  idea  is  to  use  complex 
analysis  to  produce  a  series  expansion  of  the  potentials. 


2.1  The  Multipole  Expansion  Method 

In  the  two-dimensional  A-body  problem,  we  consider  point  charges  and 
Coulombic  forces.  For  a  charge  q  located  at  the  point  y  =  (2/1 , j/2)  £  R2, 
the  potential  at  a  point  x  =  (xl5x2)  is  g  log  [[  x  —  y  ||.  If  we  identify  R2  with 
the  complex  plane  C1  via  (xi,x2)  <-►  xi  +  zx2,  we  can  consider  the  potential 
to  be  the  real  part  of  the  analytic  function  <j>y  :  C1  — >  C 1 

<i>y(x)  =  glog(x  -  y ). 

We  can  expand  (t>y(x)  in  a  Taylor  series  that  converges  for  any  x  with 
|x|  >  |y|: 


<Mx)  =  ?log(x  -y)  =  q  log(x)  -(-  log  ^1  =q 


00  1  fy'k 


^(*) -E 


k=\ 


If  we  retain  only  p  terms  of  the  expansion,  the  error  is  bounded: 
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<f>y(x)-q\og{x)  +  qY^~(^  <  — 


<  9 

y  1 

"l- y- 

X 

X 

This  observation  leads  to  the  following  multipole  expansion :  Given  an 
ensemble  of  m  charges  {<?('), i  =  located  at  points  {3/^,1  = 

with  <  ro,  the  potential  <t>y^){x )  at  any  point  x  G  Cl  due  to  the  charge 
qM  at  j/W,  with  |x|  >  r0,  is 


°°  1 

<(>yi'){x)  =  q(l)  log(x)  ~  S  T V  k  ’  *  =  1,  (2-1) 

k  xk 

The  total  potential  at  the  point  x  £  C1  due  to  all  the  charges,  therefore,  is 


where 


m  oo 

<t>{x)  =  Z  4>y('){x)  =  Q  log(x)  +  Z  "T>  (2-2) 


™  "L  — 

<2  =  Z  9 '  and  a*  =  Z - E — ~ 

: — i  • 1  fC 


Moreover,  for  any  p  >  1, 

<f>(x)  -Q  log(x)  —  Z  “T  — 


p  a,  A  17*0  lp+1 


l  _  1°  I  x 


where 


-4  =  £  l?(i)l- 


In  particular,  if  |r*o/x|  <  1/2,  we  have 


p  i  p+i 

<f>{x)  -Q  log(x)  -  Z  —  ^  2A  9 

*=i  x  z 


I 


;w 


SOT 


i 
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To  obtain  an  approximation  accurate  to  within  a  given  precision  e,  it  suffices 
to  retain  only  the  first  p  terms  in  the  multipole  expansion  (2.2)  and  to  take 
p  to  be  of  the  order  [—  Iog2  ej . 

The  ability  to  sum  individual  expansions  of  (2.1)  to  obtain  the  multipole 
expansion  of  (2.2)  depends  on  the  fact  that  these  expansions  {<^y(i)(x),i  = 
l,...,m}  have  a  common  reference  point  and  a  common  region  of  conver¬ 
gence.  If  they  don’t,  we  need  to  shift  the  series  =  l,...,m}  to 

a  common  reference  point.  Also,  to  insure  that  all  the  shifted  series  have 
a  common  region  of  convergence,  we  may  need  to  “flip”  a  series  so  that  its 
region  of  convergence  lies  inside  a  disk,  rather  than  outside  a  disk.  Rokhlin 
and  Greengard  [GR86]  derive  shifting  and  flipping  lemmas  that  describe  how 
to  perform  these  operations  in  time  independent  of  the  total  number  of  par¬ 
ticles.  These  lemmas  allow  us  to  manipulate  the  multipole  expansions  in  a 
manner  required  by  the  following  fast  algorithm. 


2.2  Description  of  the  Algorithm 

Rokhlin  and  Greengard’s  O(N)  algorithm  computes  separately  the  potentials 
of  close- by  particles  (“near- field  potentials”)  and  the  potentials  of  far-away 
particles  (“far- field  potentials”).  The  near- field  potentials  are  computed  us¬ 
ing  direct  evaluation.  If  the  number  of  particles  near  any  given  particle  is 
bounded  by  a  small  constant,  that  is,  if  the  distribution  of  particles  is  rela¬ 
tively  uniform,  then  the  work  required  to  compute  the  near-field  potential  at 
this  particle  is  of  order  0(1).  The  fax-field  potential  is  obtained  by  evaluat¬ 
ing  the  p-term  multipole  expansion  described  above  at  the  particle  position, 
which  takes  constant  number  of  operations  for  a  prespecified  p.  The  com¬ 
putation  is  organized  so  that  the  total  amount  of  work  for  computing  the 
potentials  at  all  the  N  particles  is  of  order  0(N). 

More  precisely,  the  two-dimensional  space  under  consideration  is  regarded 
as  a  square  (Figure  2.1).  It  is  divided  into  four  subsquares,  and  each  of  which 
is  then  recursively  subdivided  into  four  sub-subsquares,  and  so  on.  This  de¬ 
composition  is  represented  using  a  tree-like  data  structure  in  which  each 
square  is  represented  by  a  node.  A  square  at  level  /  of  the  tree  has  four  child 
nodes  that  represent  the  subsquares  at  level  /  +  1.  This  recursive  decompo¬ 
sition  is  continued  until  there  are  only  several  particles  in  a  square  at  finest 
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level  1 


level  1+1 


Figure  2.1:  Decomposition  in  a  two-dimensional  space,  with  the  corresponding 
data  structure 


grain.  Under  the  assumption  that  the  particles  are  uniformly  distributed1, 
the  level  of  the  tree  n,  i.e.,  the  level  of  the  decomposition,  is  usually  chosen 
as  about  [log4  N J .  Therefore  each  square  at  the  finest  level  has  in  average 
one  or  two  particle  in  it. 

We  define  for  a  square  its  near-field ,  far-field,  and  interactive-field.  The 
near-field  of  a  square  consists  of  those  neighboring  squares  that  are  at  the 
same  level  of  decomposition  as  the  square.  In  Figure  2.2,  the  squares  labeled 
as  B  are  in  the  near-field  of  square  A.  The  far-field  of  a  square  is  defined  to  be 
the  exterior  area  of  its  near-field.  The  interactive-field  of  a  square  is  the  part 
of  its  far-field  that  is  in  the  near-field  of  the  square’s  parent.  In  Figure  2.2, 
the  squares  labeled  as  A'  are  in  the  interactive-field  for  the  square  A. 

The  goal  of  the  computation  is  to  compute  the  far-field  potential  expan¬ 
sions  at  all  particle  positions  with  0(N)  work.  This  is  achieved  by  propa¬ 
gating  values,  first  upwards  through  the  tree,  then  downwards,  as  follows: 

Initially,  for  each  of  the  squares  at  the  finest  level,  we  compute  the  p-term 
multipole  expansion,  valid  outside  the  square,  for  the  potential  due  to  the 
charges  inside  the  square.  The  expansion  is  performed  relative  to  the  center 
of  the  square. 
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Figure  2.2:  Near-field,  far-field,  and  interactive-field 

These  expansions  are  propagated  upwards  through  the  tree  to  produce, 
for  every  square  at  every  level,  a  multipole  expansion  for  the  potential  due 
to  all  charges  located  inside  the  square.  The  valid  region  of  convergence  for 
the  multipole  expansion  is  the  far-field  of  the  square.  To  form  the  multipole 
expansion  for  a  square  at  level  /,  we  take  each  of  its  level- /+1  subsquares,  shift 
their  multipole  expansions  to  the  center  of  the  level-/  square  and  add  them 
together.  This  is  done  recursively  at  each  level  of  the  tree  while  propagating 
upwards  through  the  tree. 

The  downward-propagation  phase  of  the  computation  produces,  for  every 
square,  a  local  potential  expansion  due  to  its  far-field  interactions.  The 
local  expansion  for  a  square  is  a  multipole  expansion  that  is  valid  inside  the 
square.  The  computation  proceeds  recursively.  Suppose  we  already  have 
a  local  potential  expansion  <f>'Ap  (x)  for  square  A' s  parent  AP.  The  local 
expansion  <t>'A{x)  for  the  square  A  is,  by  the  definition  of  an  interactive-field, 
the  sum  of  4>'ap{x)  and  those  multipole  expansions  due  to  particles  in  A’s 
interactive-field.  However,  4>'ap{x )  is  relative  to  the  center  of  Ap.  In  order 
to  combine  it  with  other  multipole  expansions  for  the  square  A,  we  shift 
to  the  center  of  A.  We  have,  from  the  upward-propagation  phase  of 
the  computation,  the  multipole  expansion  <j>A'{x)  for  each  square  A!  in  A’s 
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interactive- field.  In  order  to  combine  <j>A'{x),  we  shift  them  to  the  center 
of  A ,  and  flip  regions  of  convergence  so  that  they  have  a  common  reference 
point  and  are  valid  inside  the  square  A.  The  sum  of  the  shifted  <f>'Ap{x)  and 
4>a’{x)  is  the  local  expansion  <j>'A(x)  for  the  square  A.  This  computation  is 
performed  recursively  all  the  way  down  to  the  leaves  of  the  tree. 

Now  we  have  for  every  square  at  the  finest  decomposition  level  the  local 
expansion  due  to  its  far-field  valid  in  the  square.  To  get  the  far-field  po¬ 
tential  at  a  given  particle,  we  have  only  to  evaluate  the  local  expansion  at 
the  particle’s  position.  The  near-field  potential  is  obtained  by  evaluating  di¬ 
rectly  interactions  with  particles  in  the  near-field.  The  sum  of  the  near-field 
potential  and  the  far-field  potential  is  the  desired  potential  at  the  particle. 

Rokhlin  and  Greengard  [GR86]  show  that  the  total  work  required  is 
N(ap 2  +  bp  -f  ckn  +  d),  where  kn  is  the  upper-bound  for  the  number  of  par¬ 
ticles  in  a  square  at  the  finest  level.  The  key  observation  in  this  estimate 
is  that  each  shifting  or  flipping  of  a  p-term  multipole  expansion  takes  0(p2) 
work,  and  that  the  number  of  squares  in  a  given  square’s  interactive-field  is 
bounded  by  27.  The  total  number  of  squares  at  all  levels  is 


4°  +  41  +  •  •  •  +  4"  = 


4"+i  _  i 


41V-  1 
3 


Therefore  the  upward-propagation  phase  and  the  downward-propagation 
phase  of  the  algorithm  accounts  for  the  p2  term  in  the  estimate.  The  initial 
expansion  step  and  the  final  evaluation  step  in  the  algorithm  is  responsible 
for  the  p  term,  and  the  direct  computation  on  near-field  potentials  is  for  the 
kn  term. 

The  overall  computational  complexity  of  the  algorithm,  from  the  above 
estimate,  is  of  order  O(N). 

Rokhlin  and  Greengard  [GR86]  also  derive  that  the  error  estimate  in  the 
computation  is  C(l/2)p+1.  Given  a  precision  requirement  e,  the  number  of 
terms  p  retained  in  the  expansion  is  set  as  [_—  log2  ej . 

2.3  Remarks 

The  key  to  the  two-dimensional  O(N)  algorithm  is  that,  for  the  potential  <j>{x) 
due  to  charges  at  =  l,...,m},  we  evaluate  the  potential  at  each  of  the 

points  {x^\j  =  l,...,n}  by  applying  the  multipole  expansion  of  <f>(x),  rather 
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than  by  directly  evaluating  and  summing  the  individual  potentials  <£y(,)(x)  at 
each  Our  ability  to  apply  this  method  relies  on  the  fact  that  the  Taylor 
series  expansions  of  the  =  1, m}  are  absolutely  convergent  power 

series  whose  coefficients  are  independent  of  x.  We  produce  the  expansion  of 
<t> (r)  by  summing  corresponding  coefficients  of  the  expansions  <f>y^)(x).  In 
order  to  do  so,  these  expansions  must  have  a  common  reference  point  and  a 
common  region  of  validity,  or  they  can  be  shifted  so  that  they  do  so. 

The  applicability  of  complex-analytic  techniques  in  the  two-dimensional 
case  results  from  the  fact  that  the  two-dimensional  potential,  viewed  as  a  real¬ 
valued  function  on  R 2,  is  harmonic,  and  thus  is  the  real  paxt  of  a  complex- 
analytic  function  on  C.  In  three  dimensions,  the  potential  function  is  likewise 
harmonic.  Unfortunately,  there  are  no  corresponding  complex-analytic  tech¬ 
niques  in  three  dimensions  to  simplify  the  analysis. 

In  the  next  chapter,  we  will  produce  expansions  of  the  three-dimensional 
potential  that  are  suitable  to  support  the  multipole  expansion  method. 


Chapter  3 


A  three-dimensional  iV-body 
algorithm 


We  consider  now  the  three-dimensional  yV-body  problem,  where  the  gravita¬ 
tional  potential  at  a  point  x  €  R3  due  to  a  point  mass  y  located  at  y  G  R3 
is 


Ux)~  ii i -sir  (3-1) 

Coulombic  potential  due  to  a  point  charge  q  has  the  same  formula,  except 
that  y  is  replaced  by  — q  [Gol59]  [Arn78j. 

In  order  to  apply  the  multipole  expansion  method,  we  must  expand  this 
potential  as  an  absolutely  convergent  series  whose  coefficients  are  indepen¬ 
dent  of  x,  and  we  must  produce  an  a  priori  bound  on  the  error  when  we 
retain  only  a  finite  number  of  terms  in  the  series.  We  must  also  be  able  to 
shift  the  series  expansion  to  a  new  origin. 

The  potential  function  in  (3.1)  is  harmonic  in  everywhere  other  than  y 
[Kos64].  This  suggests  the  possibility  of  expanding  <f>y(x )  in  terms  of  spherical 
harmonics  [Hob55]. 

In  this  chapter  we  prove  two  theorems  that  derive  the  required  expan¬ 
sions  and  give  bounds  on  the  error  terms.  There  are  also  three  lemmas  that 
describe  how  to  shift  these  expansions  from  one  origin  to  another.  Using 
these  analytical  results,  we  present  an  analogue  of  Rokhlin  and  Greengard’s 
algorithm  in  three  dimensions. 

At  the  same  time  the  author  derived  the  results  [Zha87],  Greengard  also 
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extended  the  two-dimensional  algorithm  to  three  dimensions  [Gre87].  His 
method  is  similar  to  the  one  given  here,  but  the  underlying  theorems  and  ex¬ 
pansions  are  different.  Greengard  works  in  terms  of  polar  coordinates,  while 
the  formulas  and  derivations  here  are  done  in  terms  of  Cartesian  coordinates. 

In  Chapter  4,  we  present  experimental  results  that  demonstrate  the  er¬ 
ror  bound  and  order-of-growth  estimates  given  here.  However,  there  are  no 
comparable  published  experimental  results  for  the  polar-coordinate  form  of 
the  algorithm.  It  is  unknown  how  the  two  different  forms  of  the  algorithm 
compare  in  practical  implementations. 


3.1  The  Multipole  Expansion  in  Three  Di¬ 
mensions 

Given  an  ensemble  of  particle  masses  =  l,...,m}  located  at  points 

{yh1)  £  R3,n  =  l,...,m},  the  gravitational  potential  at  any  given  point 
x  €  R3  is 


*(*)  =  £ 

n= 1 

Here  =||  x  —  j/h»)  ||, 
absolutely  convergent  series 


x  — 


_  ™  -Vn> 

r(n) 


(3.2) 


n=l 


We  will  show  how  to  expand  l/r^nl  into  an 


~  X>**(y(  n  —  mi  (3-3) 

r  ijk 

where  a,jk(y W)  is  independent  of  x  and  depends  only  on  y^.  This  will 
enable  us  to  get  the  multipole  expansion  for  the  potential  $(:r)  due  to  all  the 
masses  {/^n)}  by  summing  together  the  aijfc(y^nl)’s  weighted  by  the  masses 
fi^\  ...,  respectively: 


*(*)  =  E  E  [S'")  w 

ijk  \n=l 


(3.4) 


The  force  field  is  obtained  by  taking  the  gradient  of  the  potential  field 


F  =  —  V$(x) 


!« 
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«  £  ( £  (V*1)  “«*(»(”)) )  (-ve«*(*)) 

ijfc  \n=l  / 

=  £(x;(V"OW"i)W/iM-  (3-5) 

ijfc  \n=l  / 

The  series  expansion  of  (3.5)  has  the  same  coefficients  as  that  of  (3.4),  only 
this  time,  0,y*(ar)  is  replaced  by  Q'ijk(x).  This  enables  us  to  use  the  same 
expansion  formulas  for  potentials  derived  in  the  next  section  to  compute 
multipole  expansions  for  forces. 

3.2  Theorems 

This  section  derives  necessary  formulas  for  producing  and  shifting  multipole 
expansions  in  three  dimensions,  together  with  bounds  on  the  error  when 
retaining  only  a  finite  number  of  terms  in  the  expansions. 


Figure  3.1:  Gravitational  field 

Figure  3.1  illustrates  the  quantities  needed  in  our  derivation.  The  poten¬ 
tial  at  point  x  €  R3  is  due  to  mass  fi  at  point  y  6  R3.  is  the  reference 
point,  and  j  is  the  angle  between  vectors  y^y  and  y^x.  Also  there  are 
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quantities  r  =||  x  —  y  ||,  r<j  =||  y  —  ||,  and  R  =||  x  —  ||.  We  need  the 

following  relationships  later  in  deriving  the  multipole  expansion  formula. 

From  the  basic  triangle  relationship,  we  have 

r  =  \J R2  +  —  2 Rr0  cos  7.  (3.6) 

We  derive  the  following  theorem  which  produces  an  expansion  for  1  /r  in 
a  region  of  analyticity  outside  a  sphere.  This  is  essential  in  obtaining  the 
multipole  expansion  (3.4). 


Theorem  3.2.1  Using  the  quantities  as  shown  in  Figure  3.1,  if  R  >  r0, 
that  is,  if  x  —  (xi, Xi,x^)  measured  relative  to  y^  is  outside  the  sphere  of 
radius  r0  centered  at  y^Q\  then 


1  «  di+1+k 

r  ~  .^L0aijkdx[dxidxk 


where 


i+j+k 

—  (  —  \  y+j+k  'fo _  i  j  k 

~  Hj!k!  12  3’ 


avk  =  (-!)’ 


(3.7) 


(3.8) 


and  V\,V2,V3  are  the  direction  cosines  of  y(°)j/. 

If  we  retain  only  terms  ofi+j  +  k<p  then  the  error  satisfies 


where  C  is  independent  of  p. 


(3.9) 


Proof. 

In  Figure  3.1,  the  Cartesian  coordinate  distance  between  the  points  x 
and  y  is 


r(x,y)  =||  x  -  y  ||  = 


(3.10) 
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We  expand  l/r(x,y)  as  a  Taylor  series  in  powers  of  (yg  —  y^)  and  get 


1 


r(x,y)  r(x, y) 


OO  1 

!,=y0»  0=1  a' 


8 


L/3=i  dyp 


r{x,y) 


y=yW 


(3.11) 


The  subscript  y  =  y <0)  following  the  brace  indicates  that  after  the  differenti¬ 
ation  is  done  yi,y?,y3  of  point  y  should  be  replaced  by  y^\  y2°\  y^  of  point 
yi0) .  When  x  lies  outside  the  sphere  of  radius  ro  centered  at  y(°\  the  above 
Taylor  series  converges  absolutely  and  uniformly,  we  make  the  substitution 
y  =  y*0)  before  the  differentiation  is  performed.  Using  the  identities 


d 

f  1  ) 

1-  5  I 

f  ‘  \ 

dye  ' 

Kr(*,y)) 

dig  ' 

{r(x,y)J 

(3  =  1,2,3 


and 


1 


r(x,y) 


=y(°  > 


R 


the  Taylor  series  (3.11)  becomes 

1  1  (-1)“  o  ‘  3 


r(x,y)  R 


=  i  +  E 


ye  -  vT 


ar=l 


a! 


0  \£  r0  dxg 


(3.12) 


Let  I/#  =  (yg  —  y^)  /r0.  Since  vx,v?,vz  are  the  direction  cosines  of  y(°Ly,  it 
follows  that 


d 


£  ^ dxg)  ,+£=a  i\]\k\U'U2l/z  dx\dx32dx\ '  (3'13) 


da 


Substitute  this  into  (3.12),  we  get 
1 


r(x’y)  ,Jko  1 


i+j+kr  o 


o+J+"  ,  j  k  d'+3+k 
i'.j'.k'.  1/11/21/3  dx\8x32dxk3  (nJ  ' 


(3.14) 


This  completes  the  proof  for  (3.7). 

In  order  to  derive  the  error  bound  of  (3.9),  we  will  use  the  fact  that  1/r 
can  be  expanded  in  powers  of  R  using  the  Legendre  Polynomials  [And85]. 
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1  _ 
r 


E-^tp*(cos^)  if 

a=0  rl 


<  1 


if 


(3.15) 


v  a=0  '  0 


>  1 


The  polynomials  Pa(x)  are  called  the  Legendre  Polynomials 


po(s)  =  l,  Pi(x)  =  x,  P2(x)  =  i(3x2  -  1),  ••• 

The  series  of  (3.15)  is  absolutely  convergent.  Notice  that  the  Legendre 
Polynomials  Pa(cos7)  vary  with  the  directions  of  the  vectors  y(°~>x  and  y^y, 
and  thus  is  a  function  of  the  point  x  at  which  the  potential  is  to  be  evaluated, 
as  well  as  a  function  of  the  mass  location  y. 

Now  we  can  complete  the  proof  of  (3.9).  Using  the  notation 


a 


the  series  of  (3.12)  becomes 


(-!)“«  d°  A  1 


r(x,y)  R  + ~T i  a!  r°  drg 


s) 


(3.16) 


Notice  that  there  is  a  term-by-term  correspondence  between  this  series  and 
the  series  of  (3.15).  By  equating  terms,  we  have 


ua+l  fla  /  1  \ 

a(coS7)  =  (-ir— ^(-). 


(3.17) 


fta+l  Qa  j 

Surprisingly,  — T'o-S'('n^  only  depends  on  7  ,  not  R. 
a;!  otq  R' 

Using  the  correspondence  between  (3.16)  and  (3.15),  we  have 


1 

r 


«+j+fc<p 

£ 


Qi+j+k 

v k  dx\ dx^dx* 
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Furthermore, 


aijk  =  £  (- fi(n)aijk )  ■ 


i+j+k<p  Qi+j+k 

~  ^  a'lk  A-yi  A 


where 


ij,k  dx\dx:2dxk3  VII  x-y(°) 


c  =  £  c/i(n). 


x  —  y(°) 


(3.19) 


The  following  theorem  produces  an  expansion  for  1  /r  in  a  spherical  region 
of  analyticity,  which  is  the  basis  for  flipping  the  region  of  analyticity  in 
shiftings,  as  required  by  the  fast  algorithm. 

Theorem  3.2.2  Using  the  quantities  in  Figure  3.1  again,  if  R  <  r0,  that  is, 
if  x  —  (xi,X2,x3)  measured  relative  to  lies  inside  the  sphere  of  radius  r0 
centered  at  y(°f  then 

1  <» 

~  ~  1X2^3  (3.20) 


where 


i,j,k= 0 


('  —  iV+i+fc  L*J  t^J  UJ 

=  ~T+J+k+i  'L'LY.ab 

r0  7=0  0=0  a=0 


(3.21) 


A  = 


2  )  O'  +  J  +  k  ~  Q  ~  ft  ~  7)! 

i  +  j  +  k  -  a  —  0  —  7 J  (i  —  a)!(j  —  f3)l(k  —  7)! 


here  i/\ ,1/2,  ^3  ace  the  direction  cosines  ofy(°)y. 

The  error  bound  for  truncating  terms  of  i  +  j  +  k  >  p  satisfies 

R  P+1 

M  <  c  - 

Co 


(3.22) 


/  «* 
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Proof. 

Using  the  relationship  in  (3.6),  we  have 


1 

r 


_ 1 _ 

\JR?  +  Tq  —  2  Rr0  cos  7 
1 


cos  7 


(3.23) 


Denote  by  t1,t2,t3  the  direction  cosines  of  y(°)x,  and  ^3  the  direction 

cosines  of  y(°)y.  Then 


cos  7  =  T1U1  +  T2V7  +  r3i/3 


where  7  is  the  angle  between  y(°)x  and  y(°)y.  This  leads  to  the  equality 


(  R\2  _ _x\  +  x\  +  x\  ^XiUi  +  X2V2  +  X3I/3 

I  1  Z  -  COS  "I  —  2  —  Z - . 


\r0/ 


r0 


r0 


If  we  introduce  the  variables 


Xi  x2 
h  =  — ,  £2  =  — ,<3 
r0 


z3 


r0 


,  and 


a  —  t\{t\  —  2 17),  f3  —  t2(^2  —  2^2)1 7  =  ^3(^3  —  2i/3), 
equation  (3.23)  becomes 

1  1  1 


r  r°  ^1  4-  (a  +  0  +  7) 
We  can  expand  this  as  a  Taylor  series 


1  1  00 

-  =  -E 

r  r»n=0\n 


(a  +  /3  +  7)", 


that  is  valid  for  |a  +  /?  +  7)  <  1.  But 


(a  +  /?  +  7)n  =  £ 

i+j+k=n 


(3.24) 


(3.25) 
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thus 


I  =  1  V 
r  “  n>t.j£0\>  + 


\  (»  +  i +  fc)i, 
j  +  Jfe/  z!j!ik!  M 


7 


=  y  £ 

r°  «,i,A:=0 


(3.26) 


where 


= 


(i  +  j  +  k)\ 


ijk  \i+j  +  k)  ’ 

Substitute  a,  ft,  7  with  (3.24),  the  Taylor  series  of  (3.26)  becomes 


(3.27) 


where 


LtJ  UJ  UJ 

Kjk  =  (-l)’+'7  +  fc  £  £  6(.-Or),(j-X?),(fc — y) 

T=0  0—0  Or=0 

"  °^(3  ~  ^  ~  7)  (2u1)i-2a(2v2)j~20(2u3)k-^. 


a 


Write  (3.27)  exphcitly  in  powers  of  xi,x2,  and  x3 


1 


r  ”  S  bijkx\xJ2xk 


i,j,k= 0 


where 


l  & 

,J<:  ~  J+j+fc+i  • 
ro 


As  in  Theorem  3.2.1,  the  error  bound  for  truncating  terms  of  i+j  +  A:  >  p 
in  (3.20)  satisfies 

|i?|p+1 


kl  <  C 


r0 


In  the  following  lemma,  we  produce  a  formula  which  shifts  the  center  of 
a  multipole  expansion  valid  in  a  region  of  analyticity  outside  a  sphere. 
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Figure  3.2:  Lemma  3.2.1 


Lemma  3.2.1  Given  a  multipole  expansion 

Qaa,h  dxxdxldxl 


(i). 


(3.28) 


with  respect  to  y(°\  valid  in  the  region  outside  the  sphere  of  radius  ro  centered 
at  y(°\  suppose  we  shift  the  reference  point  t/°)  to  y'^  with  |j  y(°)  —  y'(°)  ||  = 
A r,  as  shown  in  Figure  3.2.  Note  that  R1  =||  x  -  y '(0)  ||.  The  new  multipole 
expansion  of  (3.28),  with  respect  to  y'(°\  in  the  region  outside  the  sphere  of 
radius  ro  +  Ar  centered  at  y'^  is 


where 


00  ^t+i+fc 

$m  =  T,  ciik — - — , — 
i,tk=o  dXldXidX* 


Cijk  =  EEE  O’a0-rai-a,j-g,k--i- 

a=0  0=0  -y=0 


(3.29) 


(3.30) 


Here  X  =  (X\,  X2,  X3)  are  the  Cartesian  coordinates  of  the  point  x  relative 
to  y'^°\  and  a\-k  are  the  coefficients  in  the  expansion  for  1 IR  with  respect  to 
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The  error  bound  for  retaining  terms  of  i  +  j  +  k  <  p  in  (3.29)  satisfies 


M<c 


+  Ar|p+1 


(3.31) 


Proof. 


Using  Theorem  3.2.1,  we  expand  l/R,  with  respect  to  y'^°\  in  the  region 
outside  the  sphere  of  radius  Ar  centered  at  y W  as 


I  =  V  '  di+i+k  ( L) 

R  ,j^l0  a'jk  dX{ dX{ dX%  Kw)’ 


(3.32) 


where 


a '  —  /  +J+fcr*rjrfc 

aijk-\  Tir2r3- 


TiiT2,T3  are  direction  cosines  of  y'(°)y(°).  Notice  that  after  the  shifting  there 


Xi  -  Xi  =  y(0>  -  y 


(°)  _  ,/(°) 


i  =  1,2,3 


Therefore 


d  _  d 

dxi  ~  dXi 

Substitute  (3.32)  into  (3.28),  we  have 


d°+<3+^  T  ~  #  Qi+j+k  /  i  n‘ 

rfdxtdxZ  n°ijk dxidxidx*  \R?)  ’ 


$(x)  «]h=o aafh dx^dx02dxl  dx\dxidx$ 


that  is 


00  «=  Qa+0+-t+i+j+k  /  1  \ 

^{X)  =  *jb=o „£o  a°3'<a'jk dX(+'dX0+Jex;+k  VRf)  ’ 


which  is  vahd  outside  the  sphere  of  radius  r0  +  Ar  centered  at  y'^0'1 .  Make 
substitutions  of  a  +  i  =  i',  j3  +  j  =  j’,  7  +  k  =  k1,  we  get 


$(*)  =  V  cvvv - ”7 - - - (~) 

dXl'dXl  dX?  \R'J 


1% 


p 

|w 

m 


tows© 


w 
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where 


i'  j'  k' 
Ci'j'k'  =  SEE 

a=0  0=0  ->=0 


So  fax  we  have  proved  the  expansion  (3.29). 

Since  the  multipole  expansion  of  (3.29)  is  unique  in  terms  of  the  deriva¬ 
tives  of  1  /R1,  we  could  estimate  the  error  bound  for  truncation  in  (3.29)  as 
if  we  expanded  it  directly  with  respect  to  y'(°\  From  (3.19)  we  know  that  is 


M  <  C' 


ro  +  A  r 


R! 


p+ 1 


We  now  show  how  to  convert  a  multipole  expansion  into  a  local  expansion 
valid  within  a  spherical  region  of  analyticity,  by  shifting  the  reference  point. 
The  region  of  analyticity  is  flipped. 


Figure  3.3:  Lemma  3.2.2 


Lemma  3.2.2  Given  a  multipole  expansion 

oo  Qa+0+~t 

$(X)  =  aa^Qxad  0Q  1 

a}0,-r=O  UXlUX2UX3 


s)- 


(3.33) 
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with  respect  to  y(°\  valid  in  the  region  outside  the  sphere  of  radius  ro  centered 
at  y^°\  suppose  this  time  the  reference  point  y is  shifted  A r  to  y with 
A r  >  ro,  as  shown  in  Figure  3.3.  The  new  multipole  expansion  of  (3.33), 
with  respect  to  y^°\  this  time  inside  a  sphere  of  radius  (A r  —  ro)  centered 
at  y'(°}  is 


i(X)  ='£cijkxlxixl 

ijk 


(3.34) 


where 


Ctjk  —  .j,  ,  aa0-t^>i+a,i+0,k+~f{}  +  a)‘(j  "I"  A  T)'  (3.35) 

a, 0,i=O 


Again  X  =  (Xi,  X2,Xf)  are  the  coordinates  of  the  point  x  relative  to 
bijk  are  the  coefficients  in  the  expansions  for  l/R  with  respect  to  y^°K 
The  error  bound  for  retaining  terms  of  i  +  j  +  k  <  p  in  (3.3 f)  is 


Ul<c 


Ar  —  ro 


(3.36) 


Here  R'  =||  x  —  y'^ 


Proof. 

From  Theorem  3.2.2  we  can  expand  l/R,  with  respect  to  y'^°\  inside  a 
spherical  region  with  radius  ( Ar  —  r0)  and  center  y1^  as 


1  00 

o=  £  hkXixixi 

n  i,j,k= 0 


(3.37) 


where 


bi,k  = 


y+j+fc  L*  J  UJ  UJ 

„»+i+fc+i 


(3.38) 


1=0  0=0  a= 0 


A  = 


B  = 


-J  \(i+j  +  k-a-0-7)\ 

i+j  +  k  —  a  —  0  —  -y)  (i  —  a)!^’  —  /5)!(A:  —  7)! 

'  ”  “)  (; '/)  (*  - 7)  (2.)‘-(2.r»(2,3)‘- 
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Ti ,  r2 ,  t3  are  direction  cosines  of  yd°)y(°).  Substitute  (3.37)  into  (3.33) 


00  Qa-W3-)-y  /  00  . 

a,(3,y=0  dxxdx2dx3  > 


OO  OO  OO 


a,/J,Tf= 0  i=o  ji=/9  A=-y 

z !  j1!  A: ! 


—0  \rk—y 


(i  —  a)!(j  —  ft)!(k  —  7)! 


(3.39) 


The  region  of  analyticity  is  the  inside  of  the  sphere  with  radius  (Ar  —  ro)  and 
center  at  j/'W .  After  making  substitutions  of  i  —  a  =  i' ,  j  —  /3  =  j' ,  A;  —  7  =  fe' , 
we  get 

*(*)  =  x  Ci.j.k.X'iXi'X?  (3.40) 

i',j',k'=0 

where 


Ci'j'fc'  = 


*  '7  -K  ■  a, 0,1=0 


X  ao/3'y^i'+a,j'+/3,fc,+'y(*,  +  a)K/  +  0)Kk'  +  7)"  (3.41) 


This  completes  the  proof  of  (3.34). 

Similar  to  Lemma  3.2.1,  the  error  bound  for  retaining  only  terms  of  i  + 
j  +  k  <  p  in  (3.34)  is 


|e|  <  a 


Bf 


p+i 


A? —  r0 


We  also  derive  a  formula  for  shifting  an  local  expansion  within  a  spherical 
region  of  analyticity. 

Lemma  3.2.3  Given  a  local  expansion 

OO 

$(x)  =  djkxlx2x 3, 

i,j,k= 0 


with  respect  to  y^°\  valid  in  the  region  inside  the  sphere  of  radius  r0  centered 
at  y(°\  we  expand  $(x)  inside  the  sphere  of  radius  r0  —  Ar  centered  at  y'^°\ 
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Figure  3.4:  Lemma  3.2.3 

after  yW  is  shifted  A r  to  y'W  with  A r  <  r0  as  shown  in  Figure  3. 4,  as 

*(*)=  £ 

i,j,k=0 


with  X  =  (Ai ,  X2,  X3)  the  new  coordinates  of  the  point  x  with  respect  to  y,(-°\ 
and 


iijk  —■  ci+a,j+/3,fc+'y 

a,0,~t=O 


i+aa)(!V){kV)Ax'^A*  (3-42) 


Here  y'^  —  =  (Axi,  Ax2,  Ax3). 


Proof. 

This  can  be  derived  by  re-expanding  the  expansion 

OO 

*(X)=  Y,  cijk(X1  +  Ax1)i(X2  +  Ax2y(X3  +  Ax3)k 

i,j,k=0 

in  powers  of  Xi,X2,X3.  No  truncation  errors  are  introduced  in  the  calcula¬ 
tion  of  (3.42). 


CHAPTER  3.  A  3-D  N-BODY  ALGORITHM 


27 


O' 


Having  derived  the  potential  in  the  form  of  the  multipole  expansion 

GO 

$(*)  =  5Z  cijkx\xJ2xk, 

i,j,k= 0 

we  recall  from  the  equation  (3.5)  of  Section  3.1  that  the  force  is  obtained  by 


where 


Therefore, 


F  =  -V$(s), 


d  8  d  \ 

dxi ’  dx2 ’  dx3 )  ' 


F  —  y  )  tCf IjkXi  X2X3,  y  (  jCijkX^X 2  x3,  )  ^  kCijkX1X2 


i  H  ~k—l 


i,j,k= 0 


i,j,k=0 


i,j,k= 0 


3.3  A  Sequential  Algorithm 

In  this  section  we  give  a  complete  description  of  the  three-dimensional  N- 
body  algorithm.  This  algorithm  is  similar  to  the  two-dimensional  algorithm 
of  Rokhlin  and  Greengard  described  in  Chapter  2,  however  this  time,  we  use 
the  analytical  results  obtained  in  the  previous  section.  This  formulation  of 
the  algorithm  is  for  a  sequential  computer.  We  present  a  parallel  version  in 
Chapter  5. 

The  three-dimensional  space  under  consideration  is  taken  to  be  a  cube 
(Figure  3.5).  We  apply  the  operation  of  subdividing  a  cube  into  eight  identi¬ 
cal  subcubes  repeatedly,  until  a  subcube  has  only  a  few  particles  in  it.  When 
the  particle  distribution  is  relatively  uniform,  the  level  of  subdivision  is  ap¬ 
proximately  n  =  [log8  iVj .  By  convention,  we  say  that  the  initial  cube  is 
at  level  0,  and  the  atomic  cubes ,  i.e.,  cubes  at  the  finest  level  of  the  spatial 
refinement,  are  at  level  n.  As  in  the  two-dimensional  algorithm,  we  define  for 
a  cube  its  near-field,  far-field  and  interactive-field.  The  near- field  of  a  cube 
is  defined  as  consisting  of  those  cubes  that  are  at  the  same  level  as  the  cube, 
and  have  distance  to  the  center  of  the  cube  less  than  r/3  of  the  cube  edge 
length;  the  far-field  of  the  cube  is  the  complement  of  its  near-field  and  itself; 
and  the  interactive-field  of  the  cube  is  the  part  of  its  far-field  contained  in 
its  parent’s  near-field. 


y.AAim  yyyssss. 
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Figure  3.5:  Decomposition  in  a  three-dimensional  space 

For  a  cube  i  at  level  /,  we  denote  by  <t>[(x)  the  multipole  expansion,  valid 
in  the  far-field  of  the  cube,  for  the  potential  due  to  particles  in  the  cube;  and 
by  the  local  expansion,  valid  inside  the  cube,  for  the  potential  due  to 

those  particles  in  its  far- field.  Both  <f>\(x)  and  ^[{x)  are  expanded  relative  to 
the  center  of  the  cube. 

A  description  of  the  algorithm  is  given  in  Figure  3.6.  n  is  the  total  levels 
of  spatial  refinement. 

Step  (1)  computes  initially  the  multipole  expansion  <j>?(x)  for  each  atomic 
cube.  This  is  obtained  by  adding  together  series  expansions,  relative  to  the 
center  of  the  cube,  for  potentials  due  to  individual  particles  in  the  cube. 

Step  (2)  computes  the  multipole  expansion  cf>{ x)  for  each  of  the  cubes  at 
all  intermediate  levels  in  an  upward  manner.  For  a  cube  i  at  level  l,  the 
multipole  expansion  <t>[(x)  is  computed  by  summing  <f>l+1(x)  of  cube  i's  eight 
subcubes  that  are  shifted  to  the  center  of  the  cube  i. 

Step  (3)  computes  the  local  expansion  il>(x)  for  each  of  the  cubes.  For 
a  cube  i  at  level  /,  the  computation  consists  of  two  parts.  First  it  shifts 
il>l~l(x)  of  cube  Fs  parent  to  the  center  of  cube  i,  which  constitutes  the  far- 
field  interaction.  Then,  it  computes  interaction  with  its  interactive-field  by 
summing  <f>(x)  of  the  interactive-field  that  are  shifted  to  the  center  of  cube  i. 
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The  sum  of  these  two  parts  is  the  local  expansion  t/>-(x)  for  the  cube.  This 
is  recursively  performed  when  walking  down  the  refinement  tree. 

Finally  step  (4)  computes  the  desired  potential  for  each  of  the  particles. 
The  far-field  potential  is  obtained  by  evaluating  i/>(x)  of  the  atomic  cube 
the  particle  belongs  to  at  the  particle  position.  The  near-field  potential  is 
obtained  by  a  direct  computation. 

The  error  estimate  for  the  algorithm  is  given  by  (3.31)  and  (3.36).  By 
the  definition  of  the  near-field  and  far-field  for  each  of  the  cubes,  we  have 
(ro  4-  A r)/R'  <1/2  in  (3.31)  and  R'/(Ar  —  r0)  <1/2  in  (3.36).  For  a 
required  accuracy  of  £,  we  choose  p  =  [—  log2  ej . 

Let’s  analyze  the  computational  complexity  of  the  algorithm.  The  total 
number  of  the  cubes  at  all  levels  of  the  subdivision  is 


8°  +  81  +  •  ■  •  +  8n  = 


8n+1  -  1 


8JV-  1 
7 


Step  (1)  takes  O(N)  work  to  compute  N  series  expansions  for  all  the  N 
particles.  Summing  these  expansions  to  form  d>(x)  for  every  atomic  cube 
takes  another  O(N)  work. 

In  step  (2),  it  takes  one  shifting  and  seven  summations  of  expansions  to 
compute  4>(x)  for  each  of  the  cubes.  This  takes  constant  amount  of  work  for 
a  prespecified  p.  Since  the  total  number  of  the  cubes  is  of  order  N,  the  work 
to  compute  4>{x)  for  all  the  cubes  in  step  (2)  is  of  order  O(N). 

In  order  to  compute  ip(x)  for  each  of  the  cubes  in  step  (3),  we  need  to  do 
one  shifting  on  the  parent  tp(x)  and  one  shifting  for  each  of  the  cube  in  its 
near-field,  and  then  sum  the  resulting  expansions.  Since  the  number  of  cubes 
in  each  cube’s  interactive-field  is  bounded  by  567,  it  takes  a  constant  amount 
of  work  to  compute  0(x)  for  each  of  the  cubes  for  a  fixed  p.  Therefore  step 
(3)  takes  0(N )  work  in  total  to  compute  ip(x)  for  all  the  cubes. 

We  have  chosen  the  level  n  of  the  subdivision  so  that  the  number  of  par¬ 
ticles  in  the  near-field  of  each  atomic  cube  is  bounded  by  a  small  constant. 
Thus  in  step  (4),  for  each  of  particles  the  direct  computation  on  the  inter¬ 
action  with  its  near-field  takes  work  of  order  0(1).  Evaluation  of  the  local 


expansion  at  the  particle  position  takes  again  constant  amount  of  work  for  a 
fixed  p.  Thus,  step  (4)  takes  0(N)  work  to  compute  potentials  for  all  the  N 
particles. 

The  overall  complexity  of  the  algorithm  described  in  Figure  3.6,  therefore, 
is  the  sum  of  those  of  step  (1)  through  (4),  that  is,  order  0{N). 
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(1)  Initial  expansions:  for  each  atomic  cube  i,  compute  <£"(x). 

for  i  from  1  to  8n  with  step  1  do  begin 

compute  for  each  of  the  particles  in  the  cube  i  the  series 
expansion  for  the  potential  due  to  the  particle  by  using 
Theorem  3.2.1 ; 

sum  these  expansions  for  particles  in  the  cube  i  to  form 
<^>"(x)  for  the  cube  i. 
end. 

(2)  Upward-path:  for  each  of  the  cubes  i  at  level  l  of  the  spatial  refinement, 
compute  4>[(x)  in  an  upward  manner. 

for  /  from  n  —  1  to  0  with  step  —1  do  begin 
for  i  from  1  to  8;  with  step  1  do  begin 

shift  <f>I+l(x)  of  cube  i’ s  eight  subcubes  to  the  center  of 
the  cube  i  by  using  Lemma  3.2.1; 

Sum  the  shifted  d>i+1(x)  of  cube  i’s  eight  subcubes  to  form 
4>i(x)  for  the  cube  i. 
end. 
end. 

(3)  Downward-path:  for  each  of  the  cubes  i  at  level  /,  compute  i>\(x)  in  a 
downward  manner. 

for  l  from  1  to  n  with  step  1  do  begin 

for  i  from  1  to  8*  with  step  1  do  begin 

(3a)  shift  rpl~1(x)  of  cube  i’s  parent  cube  to  the  center 
of  the  cube  i,  by  using  Lemma  3.2.3; 

(3b)  shift  4>{x)  of  the  interactive-field  to  the  center  of 
cube  i,  by  using  Lemma  3.2.2. 
sum  the  resulting  expansions  of  (3a)  and  (3b)  to  form 
0,-(x)  for  the  cube  i. 
end. 
end. 


Figure  3.6:  A  sequential  algorithm 
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(4)  Final  evaluation:  For  each  of  particles,  compute  the  potential  at  the  particle 

for  particle  p  from  1  to  N  with  step  1  do  begin 

(4a)  evaluate  0"(x)  of  the  atomic  cube  i  the  particle  p  belongs 
to  at  the  particle  position; 

(4b)  compute  directly  interactions  with  particles  in  its  near¬ 
field  and  in  the  cube  i. 

add  (4a)  and  (4b)  as  the  desired  potential  for  the  particle  p. 
end. 


Figure  3.7:  A  sequential  algorithm  (con’t) 


Chapter  4 


Numerical  Verifications  of  the 
Three-dimensional  Algorithm 


This  chapter  numerically  verifies  the  three-dimensional  algorithm  of  Chap¬ 
ter  3  with  respect  to  the  achieved  accuracy  and  speed  of  the  algorithm.  We 
first  describe  a  sequential  implementation  of  the  algorithm.  We  discuss  our 
choice  of  a  3-D  tree  data  structure,  and  describe  heuristic  methods  for  in¬ 
creasing  the  efficiency  of  the  implementation  based  on  a  detailed  analysis  on 
the  time  complexity  of  the  algorithm.  We  then  present  experimental  results 
to  demonstrate  that  the  algorithm  does  have  linear  growth  and  does  compute 
the  forces  and  potentials  to  within  any  prespecified  tolerance  up  to  machine 
precision.  We  also  compare  the  speed  of  the  algorithm  with  that  of  a  direct 
computation  and  find  that  for  a  required  average  accuracy  of  10~4  for  poten¬ 
tial  fields  our  implementation  will  be  faster  when  there  are  more  than  1,000 
particles. 

4.1  A  Sequential  Implementation 

4.1.1  3-D  Tree  Data  Structure 

As  described  in  the  previous  chapter,  the  algorithm  requires  the  data  struc¬ 
ture  used  for  the  implementation  to  hierarchically  decompose  a  three- 
dimensional  space  and  to  support  operations  such  as  insertion,  deletion,  and 
searching.  Therefore  a  3-D  tree  data  structure  is  an  obvious  choice. 

A  3-D  tree  [knu81]  is  a  balanced  eightfold-branching  tree,  in  which  each 
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level  of  nodes  corresponds  to  one  level  of  the  three-dimensional  spatial  de¬ 
composition.  The  3-D  tree  is  an  extension  of  the  two  dimensional  tree  data 
structure  discussed  in  Section  2.2  of  Chapter  2.  In  the  2-D  tree,  a  node 
and  its  four  children  correspond  to  a  square  and  its  four  subsquares;  in  the 
3-D  tree,  a  node  and  its  eight  children  correspond  to  a  cube  and  its  eight 
subcubes.  Note  that  the  root  of  the  3-D  tree  corresponds  to  the  original 
cube  of  space  under  consideration,  while  the  leaves  of  the  tree  correspond 
to  atomic  cubes.  Since  each  node  of  the  tree  corresponds  to  a  cube  in  the 
spatial  decomposition,  we  regard  the  word  “node”  and  the  word  “cube”  as 
interchangeable.  From  now  on  we  will  refer  to  a  node  as  if  we  refer  to  the 
corresponding  cube  in  three-dimensional  space,  and  use  the  phrase  “the  cen¬ 
ter  of  a  node”  instead  of  “the  center  the  cube”.  Each  node  of  the  tree  has 
pointers  to  its  children,  and  pointers  to  its  near-field  and  its  interactive-field, 
as  defined  before.  The  near-field  and  the  interactive-field  for  each  node  could 
be  computed  at  run  time,  but  in  practice  this  is  too  expensive.  In  a  long¬ 
term  simulation,  which  iterates  over  many  time  steps,  these  fields  would  have 
to  be  recomputed  again  and  again.  Therefore,  we  initially  establish  static 
pointers  to  each  node’s  near-field  and  interactive- field.  This  approach  saves 
time  but  requires  extra  storage  space  for  these  pointers. 

Particles  are  initially  contained  in  leaf  nodes  of  the  tree  according  to  their 
positions.  After  each  iteration  of  a  simulation,  information  about  particles, 
such  as  positions,  velocities,  etc.  is  updated.  A  particle  is  moved  to  a  new 
node  if  it  crosses  boundaries  of  an  atomic  cube.  Coefficients  of  expansions 
at  various  stages  of  the  computation  are  stored  in  3-D  arrays  held  by  each 
node.  The  hierarchical  clustering  of  expansions  is  done  by  walking  up  and 
down  the  tree. 

The  level  of  spatial  refinement  is  chosen  approximately  as  log8  N .  Thus, 
the  3-D  tree  is  about  log8  N  levels  deep. 

4.1.2  Implementation  Details 

In  this  section  we  discuss  the  implementation  of  the  algorithm  on  a  Symbolics 
LISP  machine,  and  describe  techniques  for  increasing  the  efficiency  of  the 
implementation. 

Speed  is  the  major  concern  in  our  implementation.  One  way  to  measure 
the  efficiency  of  our  implementation  is  to  compare  it  with  an  implementation 
of  the  direct  computation.  For  the  first  few  implementations  of  our  algorithm 
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on  the  LISP  machine,  the  code  ran  extremely  slowly.  For  p  =  3,  the  crossover 
point  where  the  running  time  of  our  method  falls  below  the  running  time  of 
the  direct  method  was  at  about  10,000  particles. 

Performance  monitoring  revealed  opportunities  for  both  machine- 
independent  and  machine-dependent  optimizations. 

We  discuss  the  machine-independent  optimizations  first.  Shifting  expan¬ 
sions  due  to  interactive- field  potentials  is  very  expensive.  This  accounts 
for  most  of  the  hidden  constant  in  O(N).  For  each  node  in  the  tree,  we 
need  to  do  567  shiftings  on  expansions,  since  each  node  has  567  nodes  in 
its  interactive-field.  We  can  reduce  this  cost  by  grouping  together  nodes  in 
the  interactive-fields.  More  specifically,  we  replace  eight  child  nodes  by  their 
common  parent  node  (we  call  it  a  super-node)  if  all  eight  nodes  are  in  the 
interaction-field  of  a  single  node.  Using  this  heuristic,  each  node  has  only 
140  nodes  in  its  interactive- field.  Numerical  experiments  indicate  that  this 
modification  speeds  up  the  algorithm  by  a  factor  of  about  8. 

Another  source  of  inefficiency  in  the  initial  implementation  was  the  re¬ 
dundancy  in  computing  coefficients  in  shifting  formulas.  The  repeated  calcu¬ 
lation  of  factorials  and  permutations  was  removed  by  storing  factorials  and 
permutations  in  a  table.  There  is  also  repeated  calculation  of  some  com¬ 
mon  factors  in  the  formulas.  We  simplified  this  part  of  the  computation 
by  extracting  common  factors  in  sums,  and  by  canceling  common  factors 
appearing  in  successive  stages  of  the  computation. 

For  the  machine-dependent  optimizations,  we  found  that  a  major  portion 
of  the  running  time  is  spent  on  manipulating  coefficients  of  expansions  in 
shiftings  and  evaluations.  The  bottleneck  of  the  computation  is  the  floating¬ 
point  calculation  and  indexing  of  arrays  storing  coefficients  of  various  expan¬ 
sions.  The  array  reference  time  was  reduced  by  representing  a  3-D  array  as 
a  2-D  array  of  1-D  arrays,  or  as  1-D  array  of  2-D  arrays.  This  minimizes 
the  array  reference  overhead  since  1-D  and  2-D  arrays  are  better  supported 
on  the  LISP  machine1.  Because  of  the  extensive  use  of  array  reference,  the 
above  improvement  was  significant.  Other  machine-dependent  optimizations 
included  using  tight  loops  in  frequently  called  procedures,  and  avoiding  the 
creation  of  new  arrays  in  manipulating  expansions  whenever  possible. 

The  above  optimizations  greatly  reduced  both  the  time  and  storage  re¬ 
quirements  for  the  computation.  The  reduction  in  storage  in  turn  reduces 
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the  time  due  to  disk  paging.  Overall,  the  machine-independent  optimizations 
reduced  running  time  by  a  factor  of  8  and  the  machine- dependent  optimiza¬ 
tions  provided  an  additional  factor  of  4.  The  running  time  crossover  point 
with  the  direct  computation  is  now  at  about  1,000  for  p  =  3  (See  Table  4.4). 


Experimental  Verifications  of  the  Algo¬ 
rithm 


4.2.1  Accuracy  of  the  Algorithm 


In  this  section,  we  study  the  actual  achieved  accuracy  of  the  algorithm,  and 
compare  it  with  the  theoretical  prediction  given  in  Chapter  3.  We  first  test 
our  algorithm  on  the  Pythagorean  configuration  of  three  bodies  and  observe 
how  the  error  of  the  computation  scales  with  p.  Then  we  test  the  algorithm  on 
two  typical  distribution  models,  the  uniform  distribution  model  and  Plummer 
distribution  model,  each  with  1,000  particles,  and  again  determine  how  the 
error  varies  with  p. 


(a)  The  Pythagorean  Configuration 

The  Pythagorean  configuration  of  three  bodies  was  first  investigated  by 
C.  Burrau  in  1913  [SP67].  It  is  Pythagorean  not  only  in  the  geometric  sense 
but  also  with  respect  to  the  masses.  The  sides  of  the  triangle  formed  by  the 
three  bodies  are  3,  4,  and  5  and  the  masses  of  the  three  bodies  are  also  3, 
4,  and  5.  The  configuration  is  such  that  the  body  with  mass  x  is  situated 
at  the  vertex  opposite  to  the  side  of  length  x,  where  x  is  one  of  3,  4,  or  5. 
The  initial  configuration  of  the  problem  and  the  coordinate  system  used  are 
shown  in  Figure  4.1.  Initially  the  three  bodies  are  situated  in  the  z  —  0  plane 
and  have  speeds  zero,  consequently  the  motion  is  planar. 

We  use  our  algorithm  to  compute  the  accelerations  of  the  three  bodies 
and  observe  how  error  scales  with  p.  The  experimental  results  are  given  in 
Table  4.1  and  plotted  in  Figure  4.2.  We  define  the  relative  error  in  each  of 
the  accelerations  as 
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Figure  4.1:  Initial  configuration  of  the  Pythagorean  configuration  of  three  bodies 

where  c Calculated 's  the  calculated  acceleration  on  body  i  using  our  method  in 
single  precision  arithmetic,  and  ajrue  is  the  actual  acceleration  on  that  body 
using  the  direct  method  of  force  computation  in  double  precision  arithmetic 
and  is  therefore  accurate  to  the  machine  round-off  error  of  double  precision 
arithmetic. 

In  Table  4.1,  (max  is  the  maximum  error  of  all  the  relative  errors  in  ac¬ 
celerations 


e  —  Maxle^  e ^  \ 

cmar  —  M  •  •  •  >  Relative)  > 

eTm,  is  the  root-mean-square  error 


^rms 


(^reiative) 


where  n  =  3  is  the  number  of  bodies  in  the  test,  and  (.potential- energy 
relative  error  in  total  potential  energy. 


is  the 
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p 

£max 

^rma 

^■potential— energy 

1 

0.58 

0.46 

0.0082 

2 

0.14 

0.27 

0.0020 

3 

0.17 

0.11 

4.2  x  10-4 

4 

0.078 

0.049 

5.7  x  10“4 

5 

0.040 

0.025 

2.7  x  10-4 

6 

0.019 

0.012 

8.6  x  10~5 

7 

0.0086 

0.0055 

1.3  X  10“5 

8 

0.0039 

0.0025 

6.6  X  10“8 

9 

0.0019 

0.0012 

6.8  x  10“e 

10 

9.0  x  10'4 

5.7  x  10~4 

3.5  x  10“8 

11 

4.3  x  10“4 

2.8  x  10“4 

1.1  x  10“e 

12 

2.0  x  10“4 

1.3  x  10“4 

1.7  x  10“' 

13 

9.3  x  10“s 

6.0  x  10“5 

1.5  x  10~  ' 

14 

4.4  x  10“& 

2.9  x  10“s 

9.4  x  10-** 

15 

2.1  x  10“5 

1.4  x  10“5 

5.5  x  10"8 

16 

1.0  x  io-& 

“6.6  x  10“e 

5.7  x  10“8 

17 

5.0  x  10“e 

3.2  x  10*6 

5.5  x  10"8 

18 

2.3  x  10"° 

1.5  x  10“8 

2.0  x  10“8 

19 

1.1  x  10~° 

7.8  x  10-  ' 

1.7  x  10“8 

20 

5.5  x  10-* 

4.0  x  10~* 

3.6  x  10-8 

Table  4.1:  Accuracy  test  for  the  Pythagorean  configuration  of  three  bodies,  p  is 
the  highest  degree  of  harmonics  retained  in  an  expansion. 


Table  4.1  shows  that  the  accuracy  of  the  algorithm  is  improved  by  ap¬ 
proximately  a  factor  of  2  when  p  is  incremented  by  1  for  p  greater  than  2. 
Consequently,  the  dots  in  Figure  4.2  are  distributed  nearly  on  a  line,  since 
the  error  axis  is  scaled  logarithmically. 

Note  that  the  near-field  in  this  implementation  has  been  defined  so  that 
the  ratio  appearing  in  the  error  bounds  (3.31)  and  (3.36)  of  Chapter  3  is 
1/2.  The  factor  of  2  decrease  in  errors  for  each  increase  in  p  is  thus  expected 
from  the  formal  analysis.  The  results  presented  in  Table  4.1  and  plotted  in 
Figure  4.2  match  well  with  the  predicted  error  bounds. 


The  experimental  results  exhibit  two  phenomena  that  warrant  further 
discussion.  First,  the  error  t^tative  in  individual  accelerations  does  not  de- 
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Figure  4.2:  Plot  of  the  accuracy  test  for  the  Pythagorean  configuration  of  three 
bodies,  p  is  the  highest  degree  of  harmonics  retained  in  an  expansion. 

crease  monotonically  as  p  increases.  Second,  the  calculated  accelerations  have 
nonzero  components  in  the  z-direction,  which  is  perpendicular  to  the  plane 
of  the  three-body  configuration.  Below,  we  discuss  how  these  phenomena 
arise. 

Nonmonotonic  decrease  of  t^]lative  with  increasing  p 

The  nonmonotonic  decrease  of  e^lative  for  individual  accelerations  with 
increasing  p  is  due  to  the  fact  that  a  multipole  expansion  is  a  mixed-sign 
Taylor  series.  The  error  due  to  truncating  a  multipole  expansion  is  also  a 
mixed-sign  series.  Consequently,  the  error  from  truncating  the  multipole 
expansion  does  not  necessarily  decrease  monotonically  when  retaining  more 
terms  in  the  multipole  expansion.  Nevertheless,  the  error  in  individual  ac¬ 
celerations  is  always  bounded  by  emax  in  Table  4.1,  which,  like  the  predicted 
error  bounds,  decreases  monotonically  as  p  increases. 

Nonzero  z  component  for  accelerations 

The  nonzero  z  component  in  our  computation  results  from  two  kinds  of 
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truncation:  the  primary  truncation  and  the  secondary  truncation.  We  note 
that  each  term  of  the  multipole  expansion  in  Lemma  3.2.2  of  Chapter  3 
is  approximated  by  a  convergent  Taylor  series,  which  we  will  call  term  se¬ 
ries.  In  computing  the  multipole  expansion  using  Lemma  3.2.2,  the  primary 
truncation  truncates  the  multipole  expansion  and  the  secondary  truncation 
truncates  each  term  series  . 

In  the  multipole  expansion  computed  from  Lemma  3.2.2,  the  coefficients 
in  each  dimension  are  determined  by  parameters  in  all  three  dimensions.  If  an 
infinite  number  of  terms  were  retained  in  computing  both  the  term  series  and 
the  multipole  expansion,  then  the  z  component  of  the  multipole  expansion 
would  be  zero  due  to  cancellation  among  nonzero  z  terms.  Since  only  a  finite 
number  of  terms  are  retained,  the  z  component  of  the  multipole  expansion 
is  not  completely  canceled. 

Intuitively,  we  might  expect  that  retaining  more  and  more  terms  in  the 
term  series  would  reduce  in  general  the  secondary  truncation  error  in  com¬ 
puting  the  multipole  expansion  and  in  particular  the  error  in  ^-direction.  But 
the  experimental  results  show  that  the  intuition  is  unfounded.  The  reason  is 
that  the  multipole  expansion  and  the  term  series  are  mixed-sign  series.  The 
fact  that  each  of  the  terms  in  the  multipole  expansion  is  better  approximated 
by  the  term  series  does  not  necessarily  guarantee  that  the  resulting  truncated 
multipole  expansion  is  more  accurate. 

(b)  Uniform  Distribution  of  1,000  Particles 

We  verify  the  accuracy  claim  of  our  algorithm  using  a  configuration  in 
which  particles  axe  distributed  uniformly.  In  this  test,  the  system  has  1,000 
particles,  each  of  which  has  mass  1/1000.  The  results  are  shown  in  Table  4.2. 
The  errors  emax,  emax,  and  e potential-energy  are  defined  as  before.  The  results 
match  well  with  the  prediction. 

(c)  Plummer  Distribution  of  1,000  Particles 

We  also  test  our  algorithm  to  determine  if  the  accuracy  of  the  algorithm 
is  sensitive  to  the  distribution  of  particles.  The  Plummer  distribution  is  a 
nonuniform  distribution  with  the  density  profile  [Her86] 
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p 

ax 

{■rms 

^■potential— energy 

1 

0.56 

0.064 

5.0  x  10~4 

2 

0.11 

0.016 

1.6  x  10"5 

3 

0.095 

0.0065 

4.5  x  10“tt 

4 

0.024 

0.0028 

1.1  x  10"' 

5 

0.016 

0.0014 

1.9  x  10"7 

hr 

0.0083 

7.0  x  10"4 

2.9  x  lO"7 

7 

0.0026 

3.3  x  10-4 

2.7  x  10"' 

8 

0.0017 

1.9  x  lO"4 

1.3  x  10“7 

9 

6.9  x  10~4 

9.7  x  10“6 

3.2  x  lO"8 

Table  4.2:  Accuracy  test  for  the  uniform  model,  p  is  the  highest  degree  of 
harmonics  retained  in  an  expansion. 


which  is  spherical  and  isotropic.  M  is  the  mass  of  the  system  and  r0  is  the 
scale-length. 

In  the  test  the  initial  configuration  of  the  system  has  two  clusters,  each  of 
which  is  a  Plummer  system  with  500  particles.  Two  clusters  have  the  same 
M  and  r0  and  are  separated  by  5r0.  A  two-dimensional  projection  of  this 
configuration  is  shown  in  Figure  4.3. 

The  experimental  results  are  given  in  Table  4.3.  Again,  they  match  very 
well  with  the  theoretical  prediction. 

In  summary,  the  accuracy  of  the  algorithm  is  demonstrated  in  Table  4.1, 
Table  4.2,  and  Table  4.3.  These  indicate  that  our  method  can  compute  forces 
and  potentials  to  within  any  prespecified  tolerance  up  to  the  machine  round¬ 
off  precision,  which  is  about  7  decimal  digits  in  single  precision  arithmetic 
[Ref85].  As  we  retain  more  and  more  terms  in  the  multipole  expansions, 
the  accuracy  of  our  method  improves,  and  when  p  =  20  the  error  of  the 
computation  is  at  the  level  of  the  machine  round-off  errors. 

4.2.2  Running  Time  of  the  Algorithm 

In  this  section,  we  test  the  speed  of  our  algorithm  and  experimentally 
determine  how  the  running  time  grows  with  the  number  of  particles.  We  also 
compare  the  results  with  those  of  a  direct  computation  and  determine  the 
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Figure  4.3:  Initial  configuration  of  two  Plummer  systems 


^max 

€rm» 

^potential— energy 

0.56 

0.076 

3.3  x  10~4 

0.13 

0.019 

1.8  x  10-& 

0.032 

0.0053 

7.7  x  10"6 

0.014 

0.0018 

1.2  x  lO'* 

0.0066 

7.3  x  10"4 

7.5  x  10"7 

0.0030 

3.3  x  lO”4 

2.6  x  10~' 

0.0013 

1.6  x  10"4 

6.4  x  10~8 

5.9  x  lTF*- 

7.9  x  lO"* 

2.1  x  10"8 

2.8  x  10"4 

4.0  x  10~5 

2.1  x  10"s 

Table  4.3:  Accuracy  test  for  the  Plummer  model,  p  is  the  highest  degree  of 
harmonics  retained  in  an  expansion. 
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running  time  crossover  point.  In  the  test,  particles  are  initially  distributed 
uniformly  within  a  spherical  region.  The  space  under  consideration  is  the 
cube  of  space  containing  the  spherical  region. 

The  results  of  the  test  on  the  speed  of  the  algorithm  are  given  in  Table  4.4, 
and  are  plotted  in  Figure  4.4.  The  running  time  growth  rate  of  the  direct 
computation  is  also  plotted  for  comparison.  In  the  plot,  both  the  running¬ 
time  axis  and  the  number-of-particles  axis  are  scaled  logarithmically.  The 
parameter  p  of  the  algorithm  is  chosen  as  3  and  the  calculated  potentials  are 
accurate  in  average  to  about  10  “4. 

Figure  4.4  shows  that  the  running  time  of  our  sequential  implementation 
of  the  algorithm  grows  linearly  with  the  number  of  particles.  The  jumps  in 
the  curve  when  the  number  of  particles  are  500  and  4,000  are  due  to  the 
overhead  of  maintaining  the  tree  structure  when  the  tree  grows  one  level 
deeper.  The  height  of  the  tree  grows  logarithmically  with  the  number  of 
particles  to  enforce  the  constraint  that  the  number  of  particles  in  each  atomic 
cube  is  bounded  by  a  predetermined  constant. 

The  running  time  crossover  point  of  our  algorithm  with  the  direct  com¬ 
putation  is  at  about  1,000  particles.  This  means  that  for  a  required  average 
accuracy  of  10~4  for  the  potentials,  our  sequential  algorithm  is  faster  than 
direct  computation  when  there  are  more  than  1,000  particles. 


4.3  Discussion 

We  have  numerically  verified  the  three-dimensionai  algorithm  with  respect 
to  the  achieved  accuracy  and  speed  of  the  algorithm  in  the  previous  sections. 
In  this  section,  we  discuss  some  additional  issues  concerning  the  use  of  our 
algorithm. 

(a)  Tree  Overhead 

We  note  the  overhead  of  maintaining  the  3-D  tree  data  structure  in  Sec¬ 
tion  4.2.2.  The  tree  has  a  complicated  pattern  of  links  for  the  near-fields  and 
the  interactive-fields  at  each  of  the  nodes.  Much  of  the  computation  time  is 
spent  on  tracing  these  pointers.  When  the  tree  is  not  fully  loaded,  this  part 
of  the  overhead  will  dominate.  This  accounts  for  the  jumps  in  the  curve  in 
Figure  4.4. 
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(b)  Distribution  of  Particles 

We  have  assumed  in  Chapter  3  that  the  distribution  of  particles  is  rel¬ 
atively  uniform  within  the  region  under  consideration.  For  very  clumpy 
distributions,  the  performance  of  the  algorithm  will  degenerate.  If  we  still 
construct  the  decomposition  tree  with  height  of  |_log8  N\ ,  step  (4)  of  the 
algorithm  would  take  more  than  O(N)  time  to  compute  all  the  near-f.dd 
interactions.  Otherwise,  to  maintain  the  condition  that  the  number  of  parti¬ 
cles  in  each  atomic  cube  is  bounded  by  a  predetermined  constant,  we  would 
have  to  construct  the  tree  with  height  of  0(N )  in  the  worst  case.  Then 
step  (2)  and  step  (3)  of  the  algorithm  in  Figure  3.7  of  Chapter  3  would  take 
time  exponential  in  N  to  compute  all  the  multipole  expansions.  One  way  to 
improve  the  efficiency  is  to  dynamically  prune  empty  tree  branches,  that  is, 
if  no  particles  are  founded  in  a  branch,  the  branch  will  be  pruned  from  the 
tree. 

(c)  Choice  of  p 

We  have  demonstrated  in  Section  4.2.1  that  the  accuracy  of  our  method 
improves  as  we  retain  more  and  more  terms  in  the  multipole  expansions. 
How  big  should  the  parameter  p  be  in  practice?  This  question  depends  on 
the  specific  nature  of  a  problem.  The  goal  of  a  numerical  simulation  is  not 
always  “accuracy”  in  a  strictly  mathematical  sense,  but  rather  “fidelity”  to 
the  underlying  physics  in  a  sense  that  is  looser  and  more  pragmatic  [Pe86]. 
Often,  as  in  galaxy  simulations,  the  conserved  system  quantities  such  as  total 
energy  and  total  momentum  are  of  more  interest,  so  monitoring  on  these 
quantities  will  better  reflect  how  good  an  algorithm  is.  Sometimes  we  only 
want  to  look  at  the  statistical  behavior  of  a  system.  In  such  context,  some 
types  of  errors  are  much  more  tolerable  than  others.  Another  reason  is  that 
the  error  introduced  by  the  discrete  integration  time  step  of  a  simulation  is 
itself  comparable  with  that  of  truncation  [App85].  Therefore  choosing  small 
p  often  suffices.  This  will  greatly  reduce  the  constant  factor  in  our  algorithm. 
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Chapter  5 

A  Parallel  Algorithm 


5.1  Introduction 

It  is  clear  that  the  A-body  method  of  Chapter  3  can  take  significant  ad¬ 
vantage  of  parallelism.  With  a  massively  parallel  computer,  where  we  can 
allocate  a  separate  processing  element  for  each  particle,  we  can  compute  from 
expansions  the  potentials  for  all  of  the  particles  in  one  parallel  step.  We  can 
also  propagate  values  up  and  down  the  spatial-decomposition  tree,  as  in  steps 
(2)  and  (3)  of  the  algorithm  in  Figure  3.6,  in  parallel,  generating  expansions 
for  all  the  nodes  at  a  given  level  in  one  parallel  step.  The  depth  of  the  tree 
grows  as  log  A,  so  a  complete  propagation  will  require  time  at  least  on  the 
order  of  log  A. 

In  this  chapter,  we  present  an  extremely  fast  implementation  of  the  A- 
body  code — a  parallel  implementation  of  the  algorithm,  whose  measured  run¬ 
ning  time  does  indeed  grow  as  0(log  A).  The  code  runs  on  the  Connection 
Machine,  which  is  a  massively  parallel  SIMD  computer,  consisting  of  up  to 
65,536  processors,  each  with  its  own  local  memory,  connected  in  a  commu¬ 
nications  network.  On  the  Connection  Machine  model  CM-2  that  we  have 
been  using,  each  processor  is  a  one-bit  ALU  with  floating-point  unit  and  64K 
bits  of  local  memory. 

In  designing  a  practical  algorithm  for  a  large  parallel  machine,  there  are 
two  issues  that  must  be  addressed  in  addition  to  the  issue  of  deciding  which 
steps  should  be  performed  concurrently.  For  many  large  parallel  computa¬ 
tions,  local  computations  at  each  processing  element  are  relatively  cheap, 
while  communications  among  processing  elements  are  expensive.  The  bot- 
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tleneck  of  the  computation  is  the  interprocessor  communication.  Thus,  an 
efficient  program  must  be  designed  to  minimize  both  the  need  for  interpro¬ 
cessor  communication  and  the  contention  generated  by  whatever  communi¬ 
cation  is  required.  For  the  parallel  jV-body  code,  we  exploit  the  locality  of 
the  algorithm  to  reduce  the  need  for  communication.  In  addition  we  exploit 
the  regularity  in  the  communication  patterns  of  the  algorithm  in  order  to 
substantially  reduce  contention  for  the  underlying  communication  resource. 


5.2  Why  a  Parallel  Computer? 

In  the  tree  walking  of  our  sequential  algorithm,  computations  on  expansions 
could  be  done  concurrently  within  each  level  of  the  tree.  Tremendous  speed¬ 
up  can  be  expected  if  we  exploit  this  parallelism.  The  question  is  how  to  map 
our  problem  onto  a  parallel  machine,  either  a  machine  specially  designed 
for  this  problem  or  a  commercially  available  one,  such  as  the  Connection 
Machine. 

Let  us  take  a  close  look  at  the  sequential  algorithm  we  developed  in  Chap¬ 
ter  3.  First,  the  tree  computation  has  regularity  over  all  nodes  at  each  level  of 
the  tree.  A  SIMD  machine  suffices  for  computation  of  this  pattern.  Second, 
the  tree  computation  has  concurrency.  Computation  at  each  of  the  nodes 
within  a  single  level  of  the  tree  can  proceed  at  the  same  time.  Third,  the 
tree  computation  has  spatial  locality.  In  the  upward-path  of  tree  walking, 
each  of  the  nodes  talks  only  to  its  parent  node.  In  the  downward-path  of 
tree  walking,  each  of  the  nodes  talks  not  only  vertically  to  its  parent,  but 
also  laterally  to  those  in  its  interactive- field.  The  lateral  communication  with 
a  node’s  interactive-field  is  local  to  a  subtree  rooted  the  node’s  predecessor 
four  levels  up  in  the  tree.  Depending  on  how  we  construct  the  tree  on  a 
parallel  machine,  we  might  be  able  to  exploit  this  locality.  Lastly,  the  tree 
computation  has  dependencies  among  levels  of  the  tree.  More  specifically, 
computation  at  one  level  of  the  tree  cannot  proceed  unless  the  computation 
one  level  up  or  down  is  finished.  This  determines  that  the  optimal  perfor¬ 
mance  we  can  achieve  from  the  tree  computation  is  proportional  to  the  height 
of  the  tree,  that  is,  0(log  N),  where  N  is  the  number  of  leaves  in  the  tree. 

We  choose  the  Connection  Machine  to  implement  a  parallel  version  of  the 
algorithm  developed  in  Chapter  3. 
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5.3  3-D  Tree  on  the  Parallel  Computer 

We  use  a  3-D  tree  as  our  data  structure,  as  in  the  sequential  implementation. 
The  goal  of  the  algorithm  design  is  to  distribute  the  components  of  the  3-D 
tree  structure  among  many  processors  so  that  the  concurrency  in  the  tree 
computation  can  be  maximumly  exploited. 

We  allocate  a  processor  to  each  node  of  the  tree.  The  upward-path  and 
downward-path  of  the  tree  walking  of  the  algorithm  require  that  each  node 
in  the  tree  has  links  to  its  parent  as  well  as  to  its  children.  Explicitly  storing 
these  links  would  be  expensive,  since  each  node  has  one  parent  and  eight 
children.  However,  since  the  tree  has  a  fixed  branching  factor,  a  regular 
allocation  of  processors  for  the  tree  will  impose  an  implicit  relationship  among 
the  processor  addresses,  as  long  as  the  tree  is  allocated  within  a  well-defined 
region  of  processors.  This  scheme  is  called  the  address-induced  representation 
of  the  tree  structure  [Hil85]  [Chr84]. 

More  precisely,  we  allocate  consecutive  processors  to  nodes,  starting  at 
the  root  of  the  tree.  We  first  make  a  sequential  breadth-first  scan  of  the  tree 
to  be  constructed,  and  label  each  node  as  we  encounter  it,  starting  from  0 
at  the  root.  Then  we  assign  to  every  node  a  processor  (or  a  fixed  number 
of  processors)  whose  address  is  the  breadth-first  scan  label  of  the  node.  We 
therefore  have  a  formula  to  induce  the  parent-child  relationship  for  every 
processor  in  the  tree.  Given  the  address  i  of  a  processor,  we  know  that  its 
parent  processor  has  address  of  [(*  —  1  )/8j ,  and  its  child  processors  have 
addresses  from  8i  + 1  to  8i  -f  8.  By  representing  the  tree  in  this  way,  we  avoid 
having  to  store  pointers  at  every  tree  node.  However,  the  saving  in  storage 
is  achieved  at  the  cost  of  computing  the  tree  structure  every  time  we  trace 
pointers. 

We  also  allocate  a  processor  to  each  of  the  particles  in  a  test  x.  Each  leaf 
node  of  the  tree  remembers  addresses  of  those  particles  that  belong  to  the 
node  spatially,  so  that  those  particles  can  be  accessed  simply  by  referring 
to  the  leaf  node  during  the  computation  of  its  near-field.  Similarly  those 
particles  also  remember  the  leaf  node  for  the  initial  expansion  stage  of  the 
computation.  This  representation  of  particle-leaf-node  links  has  an  additional 
advantage.  After  every  time  step  of  the  simulation,  the  tree  may  be  updated 
for  particles  that  move  to  new  cells.  This  process  is  easy,  since  only  particle- 
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leaf-node  links  need  to  be  updated. 


5.4  A  Parallel  Algorithm 

In  this  section,  we  will  describe  a  parallel  implementation  of  the  algorithm 
we  developed  in  Chapter  3.  The  basic  patterns  of  communication  include 
passing  and  combining  data  among  processors  vertically  and  laterally  with 
respect  to  the  embedded  tree  structure.  The  description  of  the  algorithm  is 
summarized  in  Figure  5.1.  Notations  are  used  in  the  same  way  as  that  in  the 
sequential  algorithm  of  Figure  3.6.  The  number  of  levels  in  the  tree  is  n. 

In  Step  (1)  of  the  algorithm,  each  node  of  the  tree  is  mapped  to  a  proces¬ 
sor.  Since  the  tree  only  needs  to  be  built  once,  the  sequential  construction  is 
not  a  serious  drawback  as  long  as  we  amortize  its  cost  over  many  time  frames 
in  a  simulation. 

Step  (2)  finds  for  every  node  of  the  tree  its  near-field  and  interactive- 
field  nodes.  This  could  be  done  at  run  time,  since  the  addresses  for  these 
nodes  take  a  fair  amount  of  memory  spaces.  But  this  means  that  we  have 
to  find  these  nodes  every  time  we  want  to  access  them.  We  know  that  there 
are  81  nodes  in  a  node’s  near-field,  and  567  nodes  (using  heuristics  this  can 
be  reduced  to  140  nodes)  in  each  interactive-field  (we  ignore  the  boundary 
conditions).  Experiments  have  shown  that  the  calculation  of  these  fields  is 
very  expensive.  So  we  precompute  these  fields  and  store  them.  Step  (3) 
inserts  all  the  particles  into  the  tree. 

Step  (4)  initially  expands  potentials  for  all  particles  and  forms  the  ex¬ 
pansions  $  for  leaf  nodes  of  the  tree.  An  expansion  for  the  potential  of  a 
particle  is  valid  outside  the  leaf  node  the  particle  belongs  to,  and  has  the 
center  of  the  node  as  the  reference  point.  All  expansions  of  a  single  leaf  node 
are  added  together.  This  is  done  all  in  parallel.  (3>’s  and  tP's  are  expansions 
as  defined  in  Section  3.3,  and  are  represented  as  arrays  of  parallel  variables.) 

Step  (5)  implements  the  upward  path  of  the  tree  walking,  which  computes 
$  for  every  node  of  the  tree.  The  computation  proceeds  in  this  way:  nodes  at 
one  level  of  the  tree  in  parallel  shift  the  $’s  to  the  centers  of  their  respective 
parents,  combine  the  shifted  $’s  according  to  whether  they  have  the  same 
parents,  and  send  the  results  to  the  parents. 

Step  (6)  walks  down  the  tree  and  computes  ^’s  for  every  node  of  the  tree. 
Every  node  in  parallel  fetches  $  at  its  parent  node,  and  shifts  it  to  its  center. 
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(1)  Tree  embedding:  allocate  one  processor  for  every  tree  node,  starting  at  the 
root. 

allocate  processor  0  for  the  root; 
for  i  from  1  to  n  do  begin 

for  j  from  1  to  8*  do  begin 

allocate  processor  8(8*-1  —  l)/7 + j  for  the  jth  to  the 
leftmost  node  at  level  i  of  the  tree, 
end. 
end. 

(2)  Find  for  each  of  nodes  its  near-field  and  interactive-field  nodes: 

for  all  nodes  of  the  tree  in  parallel  do  begin 

each  node  finds  its  near-field  and  interactive-field  nodes 
and  stores  the  addresses, 
end. 

(3)  Insert  N  particles: 

for  i  from  1  to  TV  do  begin 

insert  particle  i  into  the  tree, 
end. 

(4)  Initial  expansion  at  leaf  nodes: 

for  all  particles  in  parallel  do  begin 

compute  expansion  $  for  the  potential  due  to  each  of  particles 
relative  to  center  of  the  leaf  node  the  particle  belongs  to; 
add  together  those  $’s  due  to  particles  of  the  same  leaf  node 
and  form  4>  for  that  leaf  node, 
end. 


Figure  5.1:  A  parallel  algorithm 
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(5)  Upward-path  of  tree  walking: 

for  t  from  n  down  to  1  do  begin 

for  all  nodes  at  level  i  in  parallel  do  begin 
at  each  node 

shift  the  $  to  the  center  of  its  parent  node; 
sum  the  resulted  expansions  of  those  who  have  the 
same  parent  node  and  form  $  for  that  parent  node. 

end. 

end. 

(6)  Downward-path  of  tree  walking: 

for  i  from  1  to  n  do  begin 

for  all  nodes  at  level  i  in  parallel  do  begin 
at  each  node  dl 

(6a)  shift  $  at  dl’s  parent  node  to  dl’s  center; 

(6b)  for  node  d2  in  di’s  interactive-field  do  begin 
shift  $  at  d2  to  dl’s  center; 
end. 

add  the  shifted  $’s  together, 
add  (6a)  and  (6b)  to  form  ®  for  node  dl. 

end. 

end. 

(7)  Final  evaluation:  compute  near-field  and  far-field  interactions  for  each  of 
particles. 

for  all  particles  in  parallel  do  begin 
at  each  particle 

(7a)  evaluate  at  the  particle  position  the  $  of  the  leaf 
node  the  particle  belongs  to; 

(7b)  compute  directly  potentials  due  to  its  near-field, 
add  (7a)  and  (7b)  to  form  the  desired  potential. 

end. 


Figure  5.2:  A  parallel  algorithm  (con't) 
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This  is  done  level  by  level,  as  in  Step  (5). 

Finally,  step  (7)  finishes  the  computation.  Every  paxticle  evaluates  in 
parallel  the  corresponding  'P’s  at  the  particle  position.  This  part  of  the 
potential  is  due  to  the  interaction  with  its  far-field.  Then  every  particle 
computes  in  parallel  the  near-field  interactions  directly.  The  sum  of  the 
far-field  and  near-field  interactions  is  the  desired  potential  at  this  particle 
position. 

We  have  seen  from  the  above  description  that  computation  at  a  given 
level  of  the  tree  in  steps  (4)-(7)  takes  one  parallel  step.  Thus,  step  (4) 
and  step  (7)  each  take  constant  time.  Since  the  tree  has  depth  of  logAL 
and  computations  at  different  levels  of  the  tree  have  dependencies  among 
them,  a  complete  propagation  in  step  (5)  and  step  (6)  takes  O(log  N)  time. 
Therefore,  the  time  complexity  of  the  parallel  algorithm,  excluding  the  initial 
set-up  time  for  the  tree  and  communication  costs,  is  O(logiV). 

Let  us  now  look  at  the  memory  usage  in  the  computation.  When  we 
retain  in  an  expansion  only  the  terms  of  spherical  harmonics  with  degree  less 
than  four,  the  total  number  of  terms  in  an  expansion  is  20.  As  verified  by  the 
experimental  results  from  the  sequential  implementation,  this  guarantees  in 
average  a  four  decimal-digit  accuracy  in  potentials.  Each  of  the  coefficients 
in  an  expansion  is  a  32-bit  single  precision  floating-point  number.  We  have 
three  different  arrays  of  coefficients  for  each  node  of  the  tree  in  the  computa¬ 
tion.  Thus  expansions  alone  take  about  2K  bits  of  memory  at  each  of  nodes. 
Storing  the  interactive-field  and  near-field  takes  another  4K  bits  of  memory. 
We  also  need  some  stack  space  for  intermediate  computation.  Therefore, 
about  10K  bits  of  local  memory  at  each  node  suffice  for  our  purpose.  In  case 
a  machine  has  insufficient  amount  of  local  memory  per  processor,  we  can 
remedy  this  by  simulating  a  node  of  the  tree  with  several  processors.  In  such 
a  case,  however,  the  cost  of  the  communication  overhead  would  be  high. 


5.5  Experimental  Results 

We  have  implemented  the  parallel  algorithm  on  the  Connection  Machine. 
The  experimental  results  are  summarized  in  Table  5.1.  The  running  time 
growth  rate  of  the  implementation  is  plotted  in  Figure  5.3,  along  with  that 
of  the  sequential  implementation.  The  results  experimentally  verify  that  the 
parallel  implementation  of  the  algorithm  on  the  Connection  Machine  scales 
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Number  of 

Running  Time  (sec.) 

Particles 

Lisp  Machine 

Connection  Machine 

Speed-up 

64 

12.60 

9.06 

1.4 

128 

22.73 

14.99 

1.5 

256 

51.38 

23.72 

2.2 

512 

409.95 

33.86 

12 

1024 

620.92 

51.78 

12 

2048 

1050.65 

60.65 

17 

4096 

5221.29 

71.83 

73 

8192 

7203.37 

79.63 

90 

16384 

11411.68 

94.17 

121 

Figure  5.3:  Running  time  growth  rate  of  the  parallel  algorithm,  with  that  of  the 
sequential  implementation. 
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logarithmically  in  the  total  number  of  particles  in  the  simulation,  even  when 
we  take  the  communication  into  the  consideration.  We  find  that  the  parallel 
algorithm  runs  faster  than  the  sequential  algorithm,  even  on  small  exam¬ 
ples.  Compared  with  the  sequential  implementation,  the  parallel  algorithm 
exhibits  a  speed-up  factor  of  about  10  for  N= 1,000  and  a  speed-up  factor  of 
about  100  for  iV=10,0002. 

The  complexity  issue  of  the  parallel  algorithm,  which  is  complicated  by 
the  need  for  communication,  will  be  discussed  in  more  detail  in  the  following 


section. 


5.6  Communication  Patterns 

We  have  discussed  the  complexity  issue  due  to  the  floating-point  compu¬ 
tation  in  Section  5.4.  Our  experiments  have  shown  that  the  floating-point 
computation  takes  about  20%  of  the  total  running  time  and  the  interproces¬ 
sor  communication  consumes  a  major  portion  of  the  rest3. 

A  careful  study  at  the  communication  patterns  reveals  that  there  are 
two  major  kinds  of  communication  going  on  in  the  computation.  The  first 
kind  of  communication  occurs  between  adjacent  levels  of  the  tree.  We  call 
this  vertical  communication  .  The  second  kind  of  communication  occurs 
betweeu  nodes  and  their  interactive-fields  and  near-fields,  and  takes  place 
within  a  single  level  of  the  tree.  We  call  this  lateral  communication* .  The 
experimental  results  show  that,  when  the  tree  is  averagely  loaded,  the  lateral 
communication  constitutes  about  20%  of  the  total  running  time,  and  the 
vertical  communication  accounts  for  another  25%.  In  the  following  sections, 
we  describe  various  optimizations  we  have  made  to  reduce  the  communication 
overhead. 

2The  parallel  computation  has  about  200  million  floating-point  calculations  in  a  test  of 
1,000  particles,  and  has  about  1,600  million  floating-point  calculations  in  a  test  of  10,000 
particles. 

3The  results  are  obtained  on  the  Connection  Machine  using  the  metering  program 
provided  by  Shawn  Mclean. 

4Recall  from  Section  4.1.2  that  a  super-node  is  a  node  formed  by  grouping  on 
interactive-field  nodes.  Communicating  with  a  super-node  in  the  lateral  communica¬ 
tion  involves  two  adjacent  levels  of  the  tree.  Although  this  situation  is  similar  to  the 
vertical  communication,  for  the  sake  of  simplicity  we  will  consider  it  to  be  the  lateral 
communication. 
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5.6.1  Reducing  Communication  Bottlenecks 

The  lateral  communication  required  for  computing  interactions  with  near¬ 
fields  and  interactive-fields  is  one  of  the  most  expensive  parts  of  the  parallel 
computation.  The  cost  of  the  lateral  communication  is  due  largely  to  mes¬ 
sage  collisions  in  which  many  processors  try  to  access  a  single  processor  at 
the  same  time.  Since  processor  access  on  the  Connection  Machine  is  handled 
in  an  exclusive  reading  and  exclusive  writing  way,  the  time  of  a  successful 
message  routing  is  proportional  to  the  maximum  number  of  message  colli¬ 
sions. 

We  tried  to  ease  the  communication  bottleneck  by  reducing  the  number 
of  collisions  in  the  lateral  communication.  We  know  that  each  node  has  140 
nodes  in  its  interactive-field,  and  those  nodes  are  represented  as  a  list  of 
addresses.  A  close  look  shows  that  many  processors  often  have  the  same 
node  as  their  interactive- field  node  at  the  same  time,  due  to  the  way  the 
interactive-field  fist  is  constructed.  Our  algorithm,  therefore,  randomizes 
the  order  in  which  one’s  interactive-field  nodes  nodes  are  accessed,  in  the 
hope  that  two  processors  axe  less  likely  to  access  a  single  processor  at  the 
same  time.  As  a  result,  the  randomization  improves  the  performance  of  the 
algorithm. 

5.6.2  Localizing  Communication 

In  computing  the  interaction  with  a  node’s  interactive-field,  we  need  to  use 
coefficients  of  expansions  stored  in  an  interactive-field  node  again  and  again. 
Instead  of  fetching  these  every  time  we  use  them,  we  fetch  them  once  and 
store  them  on  a  temporary  local  stack.  This  can  reduce  the  traffic  by  a  factor 
of  5,  when  computing  expansions  whose  highest  degree  is  three. 

5.6.3  Combining  and  Delegating  Messages 

The  Connection  Machine  router  has  fairly  good  performance  when  the  net¬ 
work  is  averagely  loaded,  but  the  routing  time  scales  up  as  the  number  of 
collisions  in  the  routing  process.  Collisions  are  mostly  due  to  concurrent  read 
or  concurrent  write  to  a  single  processor.  To  reduce  the  collisions,  we  can  use 
the  scan  operation,  which  does  the  segmented  parallel  prefix  computation  on 
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the  Connection  Machine5.  To  resolve  collision  due  to  the  concurrent  read, 
we  first  do  an  exclusive  read,  and  then  do  a  scan-copy  operation.  In  case  of 
the  concurrent  write  combined  with  addition,  we  do  a  scan--\-  and  then  an 
exclusive  write  instead.  If  elements  of  a  segment  to  be  scanned  are  not  in 
consecutive  processors,  in  order  to  apply  the  scan  operator  we  have  to  first 
project  the  elements  to  a  new  segment  so  that  the  new  segment  has  elements 
in  consecutive  processors.  This  extra  projection  pays  off  when  there  are  more 
than  thirty  processors  in  a  segment. 


Figure  5.4:  Concurrent  write 

In  our  implementation  of  the  algorithm,  the  final  direct  interactions  with 
a  node’s  near-field  often  gives  rise  to  10-20  collisions.  Therefore  we  use 
a  different  method  to  reduce  the  collisions.  The  idea  is  to  structure  the 
message  routing  through  a  tree  that  distributes  the  message  collisions  over 
many  intermediate  nodes.  This  tree-like  routing  allows  many  of  the  collisions 
to  be  handled  in  parallel,  thereby  reducing  the  total  time  delay  due  to  the 
collisions. 

Consider  an  example  in  which  16  processors  access  a  single  processor  for 
concurrent  write  (Figure  5.4).  Using  the  tree  routing  scheme  outlined  above, 

5Given  a  binary  associative  operator  *  and  a  segment  of  sequence  of  elements 
xi,  xi, ...,  xm,  the  parallel  prefix  operation  computes  x\,  arj  *  x2>  •••!  Xi  *  x2  *  ■  ■  ■  xm 
in  0(log(m))  time  [HJ86].  In  particular,  there  are  scon-)-  and  scan-copy  operations.  The 
scan  operation  is  very  efficiently  implemented  on  the  Connection  Machine.  For  simple 
binary  operations  such  as  +  and  copy,  it  takes  the  same  order  of  time  as  a  routing  cycle 
does,  which  is  the  fundamental  unit  for  time  measurement,  and  is  therefore  considered  to 
take  constant  amount  of  time. 
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we  can  use  a  two-level  tree  to  combine  every  four  of  the  writes  together,  and 
then  combine  every  four  of  the  resulting  writes  (Figure  5.5).  Even  though 
there  are  20  total  collisions  in  this  tree,  the  time  delay  is  only  proportional 
to  8  collisions  because  the  nodes  at  the  intermediate  level  of  the  tree  handle 
their  collisions  in  parallel.  This  process  is  called  message-combining.  The 
dual  process,  message-delegating,  reduces  collisions  due  to  concurrent  read, 
as  illustrated  in  Figure  5.6. 

In  general,  suppose  there  are  N  collisions  in  total.  Each  stage  of  combin¬ 
ing  or  delegating  reduces  the  number  of  messages  by  a  factor  of  m.  Therefore 
the  number  of  stages  is  logm  N.  What  is  the  optimal  value  of  m  such  that 
the  function 


Routmg-time(m,  N)  =  m\ogm(N) 

is  minimal?  Using  elementary  calculus,  we  find  that  the  optimal  m  =  e, 
where  e  is  the  base  of  the  natural  logarithm  2.718...  Of  course,  we  must 
actually  choose  an  integral  value  for  m  (2  or  3). 

Message- combining  and  message-delegating  have  been  very  effective,  and 
in  practice  have  given  a  factor  of  2  improvement  in  speed. 


5.7  Discussion 

In  this  section,  we  discuss  trade-offs  we  made  in  our  parallel  implementation, 
and  suggest  alternatives  to  this  implementation. 

5.7.1  Grid  Representation 

We  have  used  an  address-induced  scheme  to  construct  the  tree  on  the  Con¬ 
nection  Machine.  This  mapping  scheme  has  the  merit  of  simplicity.  It  also 
saves  memory  space,  by  avoiding  explicitly  storing  pointers  in  the  tree.  As 
a  result  of  this  consecutive  mapping  scheme,  only  a  subset  of  processors  are 
active  at  a  time,  since  the  computation  proceeds  level  by  level  in  the  tree. 

The  alternative  to  the  address-induced  scheme  is  to  use  the  Connection 
Machine  grid-like  communication  network  called  the  NEWS  grid.  This  struc¬ 
ture  provides  fast  communication  for  local  or  highly  structured  communica¬ 
tion  patterns  [Hil85].  Every  processor  is  assigned  a  grid  coordinate  and  can 
be  addressed  by  specifying  this  coordinate.  The  tree  can  be  embedded  in  the 


CHAPTER  5.  A  PARALLEL  ALGORITHM 


59 


network  in  two  different  ways.  One  way  is  to  superimpose  levels  of  the  tree 
so  that  nodes  at  one  level  may  share  processors  with  those  at  other  levels,  as 
illustrated  in  Figure  5.7.  As  a  result,  some  processors  require  more  memory 
than  others  do.  This  scheme  makes  the  processor  utilization  at  most  100%. 
However,  since  all  processors  on  the  Connection  Machine  must  have  the  same 
amount  of  local  memory,  the  nonuniform  memory  requirement  across  proces¬ 
sors  wastes  memory  in  some  processors.  The  alternative  is  to  unfold  levels 
of  the  tree  and  to  assign  to  every  node  a  processor,  as  in  Figure  5.8.  This 
makes  the  memory  utilization  at  most  100%  but  wastes  some  processors. 

By  using  the  NEWS  grid,  we  can  take  advantage  of  the  fast  grid  routing 
for  highly  structured  communication  patterns,  such  as  the  communication 
with  interactive-fields  discussed  below. 

5.7.2  Exploiting  Regularities  in  Communication  Pat¬ 
terns 

To  further  reduce  the  number  of  collisions  discussed  in  Section  5.6.1,  Alan 
Ruttenberg  has  suggested  a  scheme  to  exploit  the  regularities  in  the  commu¬ 
nication  patterns.  We  have  not  implemented  this,  so  we  do  not  know  how 
much  it  decreases  the  running  time  of  the  algorithm.  To  keep  the  discussion 
simple,  we  present  the  scheme  in  two  dimensions.  The  extension  to  three 
dimensions  is  straightforward. 

We  classify  squares  of  the  spatial  decomposition  into  four  classes.  Suppose 
all  the  squares  of  smallest  size  of  Figure  5.9  are  at  level  l.  We  first  group 
every  four  squares  that  share  a  common  parent  at  level  /  —  1.  Then  we  classify 
the  four  squares  of  a  group  as  class  1,  2  ,3,  and  4  respectively  according  to 
the  relative  position  of  a  square  in  the  group,  as  labeled  in  the  Figure  5.9. 

We  find  that  for  every  group  there  is  spatial  symmetry  among  the 
interactive-fields  of  its  four  squares.  In  Figure  5.10,  square  a'  is  an  interactive- 
field  square  of  square  a.  A  clockwise  rotation  of  90  degrees  of  the  square 
a ’  with  respect  to  the  point  O  results  in  another  square  6',  which  is  an 
interactive-field  square  of  square  b,  and  so  on.  As  a  result  of  the  spatial 
symmetry,  the  four  squares  a1,  6',  c',  and  d!  are  in  different  classes,  as  shown 
in  Figure  5.10.  This  symmetry  property  can  be  used  to  completely  eliminate 
message  collisions  in  the  communication  with  interactive-field  squares.  The 
idea  is  as  follows:  each  of  four  squares  a,  6,  c,  and  d  interacts  with  squares 
a',  b',  c',  and  d'  respectively  at  the  same  time  so  that  the  communication 
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pattern  is  symmetric  with  respect  to  four  squares;  the  same  is  true  for  all 
other  groups.  Since  the  four  squares  a' ,  b',  d ,  and  d!  are  all  in  different  classes 
and  communication  patterns  of  two  different  groups  are  identical  except  that 
they  are  shifted  from  each  other  by  some  distance,  none  of  the  four  squares 
a d,  c7,  and  d!  will  be  fetched  by  more  than  one  square  at  a  time.  Thus, 
lateral  communication  can  take  place  without  collisions. 

In  Section  4.1.2,  we  discussed  super  nodes  in  the  interactive-fields.  Since 
the  super  nodes  are  one  level  higher  in  the  decomposition  tree  than  those 
ordinary  nodes  in  the  interactive-field,  interactions  with  a  super  node  using 
the  above  scheme  will  give  rise  to  collisions.  However,  the  maximum  number 
of  interactions  to  a  super  node  will  not  exceed  4  at  a  time.  Therefore  the 
collisions  can  be  avoided  by  spreading  the  data  to  be  fetched  at  a  super  node 
to  its  four  children,  and  afterwards,  access  to  the  super  node  will  be  directed 
to  the  four  children. 

In  the  above  discussion  we  also  ignored  the  boundary  case.  We  can  intro¬ 
duce  dummy  squares  outside  the  boundaries  so  that  boundary  squares  still 
preserve  the  symmetry  property  of  interactive-fields. 


5.8  Conclusion 

We  have  described  the  parallel  implementation  of  the  three-dimensional  N- 
body  algorithm  that  runs  on  the  Connection  Machine  in  order  0( log  N )  time 
in  this  chapter.  Compared  with  the  previous  ./V-body  methods,  our  method 
has  a  demonstrated  advantage  in  simulating  large  number  of  particles.  The 
parallel  implementation  achieves  tremendous  speedup  in  running  time  and 
computes  the  forces  and  potentials  to  within  any  prespecified  tolerance  up 
to  machine  precision. 

Combining  the  results  of  this  chapter  with  the  analytical  results  derived 
in  Chapter  3  and  the  experimental  results  of  the  sequential  implementation  in 
Chapter  4,  we  have  presented  a  complete  analysis  of  the  three-dimensional 
jV-body  algorithm  both  theoretically  and  experimentally.  Because  of  the 
superior  speed  and  accuracy  of  our  algorithm,  we  expect  that  it  will  find 
applications  in  astrophysics,  plasma  physics,  fluid  dynamics,  and  molecular 
dynamics.  Nevertheless,  there  are  additional  improvements  that  we  believe 
will  enhance  the  practicality  of  the  algorithm  and,  therefore,  are  worth  fur¬ 
ther  research  effort. 
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We  have  observed  in  Section  5.6  that  local  computation — mainly  floating¬ 
point  calculation  at  each  processing  element  of  the  parallel  implementation — 
takes  only  about  20%  of  total  running  time.  The  rest  of  the  running  time 
is  mostly  spent  on  communication  among  processing  elements.  The  results 
indicate  that  the  bottleneck  is  the  interprocessor  communication.  Therefore, 
an  efficient  implementation  of  the  algorithm  should  explore  the  communica¬ 
tion  requirements  of  the  algorithm. 

In  Section  5.7  we  described  alternatives  to  the  current  parallel  implemen¬ 
tation  to  improve  the  performance  of  the  algorithm.  By  using  the  NEWS 
grid,  we  can  exploit  the  localities  of  the  algorithm  and  take  advantage  of  the 
fast  grid  routing.  The  use  of  the  NEWS  grid  will  also  enable  us  to  exploit  the 
regularities  in  the  lateral  communication  of  the  algorithm.  We  expect  that 
the  resulting  implementation  will  ease  the  bottleneck  of  the  interprocessor 
communication. 
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